[Pacemaker] why so long to stonith?

Digimer lists at alteeve.ca
Wed Apr 24 00:45:19 UTC 2013


As I understand it, this is a known issue with the 1.1.8 release. I 
believe that 1.1.9 is now available from the pacemaker repos and it 
should fix the problem.

digimer

On 04/23/2013 03:34 PM, Brian J. Murrell wrote:
> Using pacemaker 1.1.8 on RHEL 6.4, I did a test where I just killed
> (-KILL) corosync on a peer node.  Pacemaker seemed to take a long time
> to transition to stonithing it though after noticing it was AWOL:
>
> Apr 23 19:05:20 node2 corosync[1324]:   [TOTEM ] A processor failed, forming new configuration.
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 188: memb=1, new=0, lost=1
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: node2 2608507072
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: lost: node1 4252674240
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 188: memb=1, new=0, lost=0
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node2 2608507072
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 4252674240/node1 is now: lost
> Apr 23 19:05:21 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 188 to 2 children
> Apr 23 19:05:21 node2 corosync[1324]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Apr 23 19:05:21 node2 corosync[1324]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1)
> Apr 23 19:05:21 node2 corosync[1324]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Apr 23 19:05:21 node2 crmd[1634]:   notice: ais_dispatch_message: Membership 188: quorum lost
> Apr 23 19:05:21 node2 crmd[1634]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
> Apr 23 19:05:21 node2 cib[1629]:   notice: ais_dispatch_message: Membership 188: quorum lost
> Apr 23 19:05:21 node2 cib[1629]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
> Apr 23 19:08:31 node2 crmd[1634]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
> Apr 23 19:08:31 node2 pengine[1633]:   notice: unpack_config: On loss of CCM Quorum: Ignore
> Apr 23 19:08:31 node2 pengine[1633]:  warning: pe_fence_node: Node node1 will be fenced because the node is no longer part of the cluster
> Apr 23 19:08:31 node2 pengine[1633]:  warning: determine_online_status: Node node1 is unclean
> Apr 23 19:08:31 node2 pengine[1633]:  warning: custom_action: Action MGS_e4a31b_stop_0 on node1 is unrunnable (offline)
> Apr 23 19:08:31 node2 pengine[1633]:  warning: stage6: Scheduling Node node1 for STONITH
> Apr 23 19:08:31 node2 pengine[1633]:   notice: LogActions: Move    MGS_e4a31b#011(Started node1 -> node2)
> Apr 23 19:08:31 node2 crmd[1634]:   notice: te_fence_node: Executing reboot fencing operation (15) on node1 (timeout=60000)
> Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: handle_request: Client crmd.1634.642b9c6e wants to fence (reboot) 'node1' with device '(any)'
> Apr 23 19:08:31 node2 stonith-ng[1630]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: fb431eb4-789c-41bc-903e-4041d50e93b4 (0)
> Apr 23 19:08:31 node2 pengine[1633]:  warning: process_pe_message: Calculated Transition 115: /var/lib/pacemaker/pengine/pe-warn-7.bz2
> Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: log_operation: Operation 'reboot' [27682] (call 0 from crmd.1634) for host 'node1' with device 'st-node1' returned: 0 (OK)
> Apr 23 19:08:41 node2 stonith-ng[1630]:   notice: remote_op_done: Operation reboot of node1 by node2 for crmd.1634 at node2.fb431eb4: OK
> Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_callback: Stonith operation 3/15:115:0:c118573c-84e3-48bd-8dc9-40de24438385: OK (0)
> Apr 23 19:08:41 node2 crmd[1634]:   notice: tengine_stonith_notify: Peer node1 was terminated (st_notify_fence) by node2 for node2: OK (ref=fb431eb4-789c-41bc-903e-4041d50e93b4) by client crmd.1634
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 192: memb=1, new=0, lost=0
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: memb: node2 2608507072
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 192: memb=2, new=1, lost=0
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: Node 4252674240/node1 is now: member
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: NEW:  node1 4252674240
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node2 2608507072
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: pcmk_peer_update: MEMB: node1 4252674240
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 192 to 2 children
> Apr 23 19:09:03 node2 corosync[1324]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: update_member: 0x1efc040 Node 4252674240 (node1) born on: 192
> Apr 23 19:09:03 node2 corosync[1324]:   [pcmk  ] info: send_member_notification: Sending membership update 192 to 2 children
>
> As you can see, 3 minutes and 10 seconds went by before pacemaker
> transitioned from noticing the node unresponsive to stonithing it.
>
> This smacks of some kind of mis-configured timeout but I'm not aware
> of any timeout that would have this effect.
>
> Thoughts?
> b.
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?




More information about the Pacemaker mailing list