[Pacemaker] Node stuck in pending state

Brian J. Murrell brian at interlinx.bc.ca
Wed Apr 9 20:49:49 CEST 2014


On Tue, 2014-04-08 at 17:29 -0400, Digimer wrote: 
> Looks like your fencing (stonith) failed.

Where?  If I'm reading the logs correctly, it looks like stonith worked.
Here's the stonith:

Apr  8 09:53:21 lotus-4vm6 stonith-ng[2492]:   notice: log_operation: Operation 'reboot' [3306] (call 2 from crmd.2496) for host 'lotus-4vm5' with device 'st-fencing' returned: 0 (OK)

and then corosync reporting that the node left the cluster:

Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: lost: lotus-4vm5 3176140298

Yes?  Or am I misunderstanding that message?

Doesn't this below also further indicate that the vm5 node did actually
get stonithed?

Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node lotus-4vm5 was not seen in the previous transition
Apr  8 09:53:26 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 3176140298/lotus-4vm5 is now: lost

crmd and cib also seem to be noticing the node has gone away too, don't
they here:

Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: plugin_handle_membership: Membership 20: quorum lost
Apr  8 09:53:26 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was member)
Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: plugin_handle_membership: Membership 20: quorum lost
Apr  8 09:53:26 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now lost (was member)

And then the node comes back:

Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 24: memb=1, new=0, lost=0
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: memb: lotus-4vm6 3192917514
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 24: memb=2, new=1, lost=0
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: update_member: Node 3176140298/lotus-4vm5 is now: member
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: NEW:  lotus-4vm5 3176140298
Apr  8 09:54:04 lotus-4vm6 corosync[2442]:   [pcmk  ] info: pcmk_peer_update: MEMB: lotus-4vm5 3176140298

And now crmd realizes the node is back:

Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member (was lost)

As well as cib:

Apr  8 09:54:04 lotus-4vm6 cib[2491]:   notice: crm_update_peer_state: plugin_handle_membership: Node lotus-4vm5[3176140298] - state is now member (was lost)

And stonith-ng and crmd reports successful reboot:

Apr  8 09:54:04 lotus-4vm6 stonith-ng[2492]:   notice: remote_op_done: Operation reboot of lotus-4vm5 by lotus-4vm6 for crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/NoHHIFi at public.gmane.org<mailto:crmd.2496-ZBdUr1hrI04s+xCAc1R/N1ez/NoHHIFi at public.gmane.org>: OK
Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_callback: Stonith operation 2/13:0:0:f325afae-64b0-4812-a897-70556ab1e806: OK (0)
Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: tengine_stonith_notify: Peer lotus-4vm5 was terminated (reboot) by lotus-4vm6 for lotus-4vm6: OK (ref=ae82b411-b07a-4235-be55-5a30a00b323b) by client crmd.2496

But all of a sudden, crmd reports the node is "lost" again?

Apr  8 09:54:04 lotus-4vm6 crmd[2496]:   notice: crm_update_peer_state: send_stonith_update: Node lotus-4vm5[3176140298] - state is now lost (was member)

But why?

Not surprising that we get these messages (below) if crmd thinks it was
suddenly "lost" (when it was in fact not according to the log for vm5:
)

Apr  8 09:54:11 lotus-4vm6 crmd[2496]:  warning: crmd_cs_dispatch: Recieving messages from a node we think is dead: lotus-4vm5[-1118826998]
Apr  8 09:54:31 lotus-4vm6 crmd[2496]:   notice: do_election_count_vote: Election 2 (current: 2, owner: lotus-4vm5): Processed vote from lotus-4vm5 (Peer is not part of our cluster)

So I think the question is, why did crmd suddenly believe the node to be
"lost" when there was no evidence that it was lost?

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140409/22a04eb2/attachment.sig>


More information about the Pacemaker mailing list