[Pacemaker] pacemaker node stuck offline

Tue Mar 26 00:41:25 UTC 2013

On 2013-03-22 03:39, pacemaker at feystorm.net wrote:
> 
> On 03/21/2013 11:15 AM, Andreas Kurz wrote:
>> On 2013-03-21 14:31, Patrick Hemmer wrote:
>>> I've got a 2-node cluster where it seems last night one of the nodes
>>> went offline, and I can't see any reason why.
>>>
>>> Attached are the logs from the 2 nodes (the relevant timeframe seems to
>>> be 2013-03-21 between 06:05 and 06:10).
>>> This is on ubuntu 12.04
> 
>> Looks like your non-redundant cluster-communication was interrupted at
>> around that time for whatever reason and your cluster split-brained.
> 
>> Does the drbd-replication use a different network-connection? If yes,
>> why not using it for a redundant ring setup ... and you should use
> STONITH.
> 
>> I also wonder why you have defined "expected_votes='1'" in your
>> cluster.conf.
> 
>> Regards,
>> Andreas
> But shouldn't it have recovered? The node shows as "OFFLINE", even
> though it's clearly communicating with the rest of the cluster. What is
> the procedure for getting the node back online. Anything other than
> bouncing pacemaker?

Looks like the cluster has some troubles trying to rejoin the two DCs
after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and
clean there the /var/lib/heartbeat/crm directory so it starts with an
empty configuration and receives the latest updates from i-a706d8ff.

> 
> Unfortunately no to the different network connection for drbd. These are
> 2 EC2 instances, so redundant connections aren't available. Though since
> it is EC2, I could set up a STONITH to whack the other instance. The
> only problem here would be a race condition. The EC2 api for shutting
> down or rebooting an instance isn't instantaneous. Both nodes could end
> up sending the signal to reboot the other node.

Yeah, you would need to add a very generous start-timeout to the monitor
operation of the stonith primitive ... but it works ;-)

> 
> As for expected_votes=1, it's because it's a two-node cluster. Though I
> apparently forgot to set the `two_node` attribute :-(

Those two parameters should not be needed for a cman/pacemaker cluster,
you can tell pacemaker to ignore loss of quorum.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130326/fda4892c/attachment-0004.sig>