[Pacemaker] will a stonith resource be moved from an AWOL node?

Tue Apr 30 10:55:41 EDT 2013

I'm using pacemaker 1.1.8 and I don't see stonith resources moving away
from AWOL hosts like I thought I did with 1.1.7.  So I guess the first
thing to do is clear up what is supposed to happen.

If I have a single stonith resource for a cluster and it's running on
node A and then node A goes AWOL, what happens to that stonith resource?

From what I think I know of pacemaker, pacemaker wants to be able to
stonith that AWOL node before moving any resources away from it since
starting a resource on a new node while the state of the AWOL node is
unknown is unsafe, right?

But of course, if the resource that pacemaker wants to move is the
stonith resource there's a bit of a catch-22.  It can't move the
stonith resource until it can stonith the node, which it cannot stonith
the node because the node running the resource is AWOL.

So, is pacemaker supposed to resolve this on it's own or am I supposed
to create a cluster configuration that ensures that enough stonith
resources exist to mitigate this situation?

The case I have in hand is this:

# pcs config
Corosync Nodes:

Pacemaker Nodes:
 node1 node2 

Resources: 
 Resource: stonith (type=fence_xvm class=stonith)

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 dc-version: 1.1.8-7.wc1.el6-394e906
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 symmetric-cluster: true
 cluster-infrastructure: classic openais (with plugin)
 stonith-enabled: true
 last-lrm-refresh: 1367331233

# pcs status
Last updated: Tue Apr 30 14:48:06 2013
Last change: Tue Apr 30 14:13:53 2013 via crmd on node2
Stack: classic openais (with plugin)
Current DC: node2 - partition WITHOUT quorum
Version: 1.1.8-7.wc1.el6-394e906
2 Nodes configured, 2 expected votes
1 Resources configured.

Node node1: UNCLEAN (pending)
Online: [ node2 ]

Full list of resources:

 stonith	(stonith:fence_xvm):	Started node1

node1 is very clearly completely off.  The cluster has been in this state, with node1 being off for several 10s of minutes now and still the stonith resource is running on it.

The log, since corosync noticed node1 going AWOL:

Apr 30 14:14:56 node2 corosync[1364]:   [TOTEM ] A processor failed, forming new configuration.
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 52: memb=1, new=0, lost=1
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: memb: node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: lost: node1 4252674240
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 52: memb=1, new=0, lost=0
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: pcmk_peer_update: MEMB: node2 2608507072
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node node1 was not seen in the previous transition
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: update_member: Node 4252674240/node1 is now: lost
Apr 30 14:14:57 node2 corosync[1364]:   [pcmk  ] info: send_member_notification: Sending membership update 52 to 2 children
Apr 30 14:14:57 node2 corosync[1364]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 30 14:14:57 node2 corosync[1364]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.122.155) ; members(old:2 left:1)
Apr 30 14:14:57 node2 corosync[1364]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 30 14:14:57 node2 crmd[1666]:   notice: ais_dispatch_message: Membership 52: quorum lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:  warning: match_down_event: No match for shutdown action on node1
Apr 30 14:14:57 node2 crmd[1666]:   notice: peer_update_callback: Stonith/shutdown of node1 not matched
Apr 30 14:14:57 node2 cib[1661]:   notice: ais_dispatch_message: Membership 52: quorum lost
Apr 30 14:14:57 node2 cib[1661]:   notice: crm_update_peer_state: crm_update_ais_node: Node node1[4252674240] - state is now lost
Apr 30 14:14:57 node2 crmd[1666]:   notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=check_join_state ]
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 30 14:14:57 node2 attrd[1664]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 30 14:14:58 node2 pengine[1665]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:14:58 node2 pengine[1665]:  warning: pe_fence_node: Node node1 will be fenced because stonith is thought to be active there
Apr 30 14:14:58 node2 pengine[1665]:  warning: custom_action: Action stonith_stop_0 on node1 is unrunnable (offline)
Apr 30 14:14:58 node2 pengine[1665]:  warning: stage6: Scheduling Node node1 for STONITH
Apr 30 14:14:58 node2 pengine[1665]:   notice: LogActions: Move    stonith#011(Started node1 -> node2)
Apr 30 14:14:58 node2 crmd[1666]:   notice: te_fence_node: Executing reboot fencing operation (7) on node1 (timeout=60000)
Apr 30 14:14:58 node2 stonith-ng[1662]:   notice: handle_request: Client crmd.1666.82b6657e wants to fence (reboot) 'node1' with device '(any)'
Apr 30 14:14:58 node2 stonith-ng[1662]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: a6371862-ac20-4cc8-a0a5-ece88528817b (0)
Apr 30 14:14:58 node2 pengine[1665]:  warning: process_pe_message: Calculated Transition 99: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Apr 30 14:16:10 node2 stonith-ng[1662]:    error: remote_op_done: Operation reboot of node1 by node2 for crmd.1666 at node2.a6371862: Timer expired
Apr 30 14:16:10 node2 crmd[1666]:   notice: tengine_stonith_callback: Stonith operation 11/7:99:0:851e4835-5df1-4210-aeef-43e0c4f07947: Timer expired (-62)
Apr 30 14:16:10 node2 crmd[1666]:   notice: tengine_stonith_callback: Stonith operation 11 for node1 failed (Timer expired): aborting transition.
Apr 30 14:16:10 node2 crmd[1666]:   notice: tengine_stonith_notify: Peer node1 was not terminated (st_notify_fence) by node2 for node2: Timer expired (ref=a6371862-ac20-4cc8-a0a5-ece88528817b) by client crmd.1666
Apr 30 14:16:10 node2 crmd[1666]:   notice: run_graph: Transition 99 (Complete=1, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Apr 30 14:16:10 node2 pengine[1665]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:     crit: get_timet_now: Defaulting to 'now'
Apr 30 14:16:10 node2 pengine[1665]:  warning: pe_fence_node: Node node1 will be fenced because stonith is thought to be active there
Apr 30 14:16:10 node2 pengine[1665]:  warning: custom_action: Action stonith_stop_0 on node1 is unrunnable (offline)
Apr 30 14:16:10 node2 pengine[1665]:  warning: stage6: Scheduling Node node1 for STONITH
Apr 30 14:16:10 node2 pengine[1665]:   notice: LogActions: Move    stonith#011(Started node1 -> node2)
Apr 30 14:16:10 node2 crmd[1666]:   notice: te_fence_node: Executing reboot fencing operation (7) on node1 (timeout=60000)
Apr 30 14:16:10 node2 stonith-ng[1662]:   notice: handle_request: Client crmd.1666.82b6657e wants to fence (reboot) 'node1' with device '(any)'
Apr 30 14:16:10 node2 stonith-ng[1662]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for node1: 2afefecc-f393-46ab-ba79-7f4968501012 (0)
Apr 30 14:16:10 node2 pengine[1665]:  warning: process_pe_message: Calculated Transition 100: (null)
Apr 30 14:17:22 node2 stonith-ng[1662]:    error: remote_op_done: Operation reboot of node1 by node2 for crmd.1666 at node2.2afefecc: Timer expired
Apr 30 14:17:22 node2 crmd[1666]:   notice: tengine_stonith_callback: Stonith operation 12/7:100:0:851e4835-5df1-4210-aeef-43e0c4f07947: Timer expired (-62)
Apr 30 14:17:22 node2 crmd[1666]:   notice: tengine_stonith_callback: Stonith operation 12 for node1 failed (Timer expired): aborting transition.
Apr 30 14:17:22 node2 crmd[1666]:   notice: tengine_stonith_notify: Peer node1 was not terminated (st_notify_fence) by node2 for node2: Timer expired (ref=2afefecc-f393-46ab-ba79-7f4968501012) by client crmd.1666
Apr 30 14:17:22 node2 crmd[1666]:   notice: run_graph: Transition 100 (Complete=1, Pending=0, Fired=0, Skipped=4, Incomplete=0, Source=unknown): Stopped
Apr 30 14:17:22 node2 crmd[1666]:   notice: too_many_st_failures: Too many failures to fence node1 (11), giving up

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 263 bytes
Desc: OpenPGP digital signature
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130430/43261e2e/attachment-0002.sig>