[Pacemaker] why so long to stonith?

Tue Apr 23 19:34:40 UTC 2013

Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
(-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
to transition to stonithing it though after noticing it was AWOL:

Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming new configuration.
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: node1 4252674240
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node2 2608507072
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 4252674240/node1 is now: lost
Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 188 to 2 children
Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1)
Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 188: quorum lost
Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 188: quorum lost
Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster
Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node node1 is unclean
Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 for STONITH
Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move    MGS_e4a31b#011(Started node1 -> node2)
Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=60000)
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK)
Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634 at node2.fb431eb4: OK
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 4252674240/node1 is now: member
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node2 2608507072
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node1 4252674240
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 192 to 2 children
Apr 23 19:09:03 node2 corosync[1324]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: 0x1efc040 Node 4252674240 (node1) born on: 192
Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 192 to 2 children

As you can see, 3 minutes and 10 seconds went by before pacemaker
transitioned from noticing the node unresponsive to stonithing it.

This smacks of some kind of mis-configured timeout but I'm not aware
of any timeout that would have this effect.

Thoughts?
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130423/317a3f1b/attachment-0003.sig>