[Pacemaker] pacemaker node stuck offline
Andreas Kurz
andreas at hastexo.com
Tue Mar 26 00:41:25 UTC 2013
On 2013-03-22 03:39, pacemaker at feystorm.net wrote:
>
> On 03/21/2013 11:15 AM, Andreas Kurz wrote:
>> On 2013-03-21 14:31, Patrick Hemmer wrote:
>>> I've got a 2-node cluster where it seems last night one of the nodes
>>> went offline, and I can't see any reason why.
>>>
>>> Attached are the logs from the 2 nodes (the relevant timeframe seems to
>>> be 2013-03-21 between 06:05 and 06:10).
>>> This is on ubuntu 12.04
>
>> Looks like your non-redundant cluster-communication was interrupted at
>> around that time for whatever reason and your cluster split-brained.
>
>> Does the drbd-replication use a different network-connection? If yes,
>> why not using it for a redundant ring setup ... and you should use
> STONITH.
>
>> I also wonder why you have defined "expected_votes='1'" in your
>> cluster.conf.
>
>> Regards,
>> Andreas
> But shouldn't it have recovered? The node shows as "OFFLINE", even
> though it's clearly communicating with the rest of the cluster. What is
> the procedure for getting the node back online. Anything other than
> bouncing pacemaker?
Looks like the cluster has some troubles trying to rejoin the two DCs
after the split-brain. Try to stop cman/Pacemaker on i-3307d96b and
clean there the /var/lib/heartbeat/crm directory so it starts with an
empty configuration and receives the latest updates from i-a706d8ff.
>
> Unfortunately no to the different network connection for drbd. These are
> 2 EC2 instances, so redundant connections aren't available. Though since
> it is EC2, I could set up a STONITH to whack the other instance. The
> only problem here would be a race condition. The EC2 api for shutting
> down or rebooting an instance isn't instantaneous. Both nodes could end
> up sending the signal to reboot the other node.
Yeah, you would need to add a very generous start-timeout to the monitor
operation of the stonith primitive ... but it works ;-)
>
> As for expected_votes=1, it's because it's a two-node cluster. Though I
> apparently forgot to set the `two_node` attribute :-(
Those two parameters should not be needed for a cman/pacemaker cluster,
you can tell pacemaker to ignore loss of quorum.
Regards,
Andreas
--
Need help with Pacemaker?
http://www.hastexo.com/now
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 287 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130326/fda4892c/attachment-0004.sig>
More information about the Pacemaker
mailing list