[Pacemaker] timeout rebooting with stonith_sbd

Tue May 14 14:49:40 UTC 2013

Hi!

I have a two-node cluster (virtual machines) with several resources and 
shared storage.
When the connectivity is lost (for some reason still needed to be 
debuged), here is what I get (I am skipping unrelated messages)

May 14 16:49:21 wcs2 corosync[27531]:   [TOTEM ] The token was lost in 
the OPERATIONAL state.
May 14 16:49:21 wcs2 corosync[27531]:   [TOTEM ] A processor failed, 
forming new configuration.

Why corosync connectivity is lost? There was nothing suspicious in the 
logs at all.

May 14 16:49:36 wcs2 corosync[27531]:   [VOTEQ ] node 739269211 state=2, 
votes=1, expected=2
May 14 16:49:36 wcs2 corosync[27531]:   [VOTEQ ] node 739269212 state=1, 
votes=1, expected=2
May 14 16:49:36 wcs2 corosync[27531]:   [QUORUM] This node is within the 
non-primary component and will NOT provide any services.
May 14 16:49:36 wcs2 corosync[27531]:   [QUORUM] Members[1]: 739269212
May 14 16:49:36 wcs2 corosync[27531]:   [QUORUM] sending quorum 
notification to (nil), length = 52
May 14 16:49:36 wcs2 crmd[11381]:  warning: match_down_event: No match 
for shutdown action on 739269211
May 14 16:49:36 wcs2 crmd[11381]:   notice: peer_update_callback: 
Stonith/shutdown of wcs1 not matched

What does that warning mean?

May 14 16:49:37 wcs2 pengine[27574]:   notice: unpack_config: On loss of 
CCM Quorum: Ignore
May 14 16:49:37 wcs2 pengine[27574]:  warning: pe_fence_node: Node wcs1 
will be fenced because stonith_sbd is thought to be active there
May 14 16:49:37 wcs2 pengine[27574]:  warning: custom_action: Action 
stonith_sbd_stop_0 on wcs1 is unrunnable (offline)
May 14 16:49:37 wcs2 pengine[27574]:  warning: stage6: Scheduling Node 
wcs1 for STONITH
May 14 16:49:37 wcs2 pengine[27574]:   notice: LogActions: Move 
stonith_sbd#011(Started wcs1 -> wcs2)

All resources were active on node wcs2 (survived), stonith_sbd was 
active on node wcs1

May 14 16:49:37 wcs2 crmd[11381]:   notice: te_fence_node: Executing 
reboot fencing operation (38) on wcs1 (timeout=60000)
May 14 16:49:37 wcs2 stonith-ng[27571]:   notice: handle_request: Client 
crmd.11381.a02439c4 wants to fence (reboot) 'wcs1' with device '(any)'
May 14 16:49:37 wcs2 stonith-ng[27571]:   notice: 
initiate_remote_stonith_op: Initiating remote operation reboot for wcs1: 
37151815-2182-42fa-b32e-86288b1808
5b (0)

Now, as these are actually virtual machines, reboot takes place quite 
quickly:

May 14 16:49:46 wcs2 crmd[11381]:   notice: pcmk_quorum_notification: 
Membership 1000: quorum acquired (2)
May 14 16:49:46 wcs2 crmd[11381]:   notice: crm_update_peer_state: 
pcmk_quorum_notification: Node wcs1[739269211] - state is now member
May 14 16:50:05 wcs2 crmd[11381]:   notice: do_state_transition: State 
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC 
cause=C_FSA_INTERNAL origin=do_election_check ]
May 14 16:50:07 wcs2 attrd[27573]:   notice: attrd_local_callback: 
Sending full refresh (origin=crmd)
May 14 16:50:07 wcs2 attrd[27573]:   notice: attrd_trigger_update: 
Sending flush op to all hosts for: probe_complete (true)
May 14 16:50:49 wcs2 stonith-ng[27571]:    error: remote_op_done: 
Operation reboot of wcs1 by wcs2 for crmd.11381 at wcs2.37151815: Timer expired
May 14 16:50:49 wcs2 crmd[11381]:   notice: tengine_stonith_callback: 
Stonith operation 11/38:2655:0:8f1636b7-dd1d-470c-b645-65a9c8743a69: 
Timer expired (-62)
May 14 16:50:49 wcs2 crmd[11381]:   notice: tengine_stonith_callback: 
Stonith operation 11 for wcs1 failed (Timer expired): aborting transition.
May 14 16:50:49 wcs2 crmd[11381]:   notice: tengine_stonith_notify: Peer 
wcs1 was not terminated (st_notify_fence) by wcs2 for wcs2: Timer 
expired (ref=37151815-2182-42fa-b32e-86288b18085b) by client crmd.11381

But why reboot operation timers expire?