[Pacemaker] Node stuck in pending state

Thu Apr 10 00:04:18 UTC 2014

On 10 Apr 2014, at 4:49 am, Brian J. Murrell <brian at interlinx.bc.ca> wrote:

> On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: 
>> Looks like your fencing (stonith) failed.
> 
> Where?  If I'm reading the logs correctly, it looks like stonith worked.
> Here's the stonith:
> 
> Apr  8 09:53:21 lotus-4vm6 stonith-ng[2492]:   notice: log_operation: Operation 'reboot' [3306] (call 2 from crmd.2496) for host 'lotus-4vm5' with device 'st-fencing' returned: 0 (OK)
> 
> and then corosync reporting that the node left the cluster:
> 
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: lost: lotus-4vm5 3176140298
> 
> Yes?  Or am I misunderstanding that message?
> 
> Doesn't this below also further indicate that the vm5 node did actually
> get stonithed?
> 
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node lotus-4vm5 was not seen in the previous transition
> Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 3176140298/lotus-4vm5 is now: lost
> 
> crmd and cib also seem to be noticing the node has gone away too, don't
> they here:
> 
> Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: plugin_handle_membership: Membership 20: quorum lost
> Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was member)
> Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: plugin_handle_membership: Membership 20: quorum lost
> Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was member)
> 
> And then the node comes back:
> 
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 24: memb=1, new=0, lost=0
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: memb: lotus-4vm6 3192917514
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 24: memb=2, new=1, lost=0
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 3176140298/lotus-4vm5 is now: member
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: NEW:  lotus-4vm5 3176140298
> Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: MEMB: lotus-4vm5 3176140298
> 
> And now crmd realizes the node is back:
> 
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member (was lost)
> 
> As well as cib:
> 
> Apr  8 09:54:04 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member (was lost)
> 
> And stonith-ng and crmd reports successful reboot:
> 
> Apr  8 09:54:04 lotus-4vm6 stonith-ng[2492]:   notice: remote_op_done: Operation reboot of lotus-4vm5 by lotus-4vm6 for crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/NoHHIFi at public.gmane.org<mailto:crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/NoHHIFi at public.gmane.org>: OK
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_callback: Stonith operation 2/13:0:0:f325afae-64b0-4812-a897-70556ab1e806: OK (0)
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_notify: Peer lotus-4vm5 was terminated (reboot) by lotus-4vm6 for lotus-4vm6: OK (ref=ae82b411-b07a-4235-be55-5a30a00b323b) by client crmd.2496
> 
> But all of a sudden, crmd reports the node is "lost" again?
> 
> Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: send_stonith_update: Node lotus-4vm5[3176140298] - state is now lost (was member)
> 
> But why?

Brian: the detective work above is highly appreciated

Little known fact, I too got sick of trying to figure out why nodes were being marked up/down and so the second '${tag}:' is actually the source of the change.
So in this case, its the function send_stonith_update() and I recognise the problem from a few weeks ago.

Essentially the node is returning "too fast" (specifically, before the fencing notification arrives) causing pacemaker to forget the node is up and healthy.

The fix for this is https://github.com/beekhof/pacemaker/commit/e777b17 and is present in 1.1.11

> 
> Not surprising that we get these messages (below) if crmd thinks it was
> suddenly "lost" (when it was in fact not according to the log for vm5:
> )
> 
> Apr  8 09:54:11 lotus-4vm6 crmd[2496]:  warning: crmd_cs_dispatch: Recieving messages from a node we think is dead: lotus-4vm5[-1118826998]
> Apr  8 09:54:31 lotus-4vm6 crmd[2496]:   notice: do_election_count_vote: Election 2 (current: 2, owner: lotus-4vm5): Processed vote from lotus-4vm5 (Peer is not part of our cluster)
> 
> So I think the question is, why did crmd suddenly believe the node to be
> "lost" when there was no evidence that it was lost?
> 
> b.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140410/b41ab369/attachment-0004.sig>