[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

Tue Dec 3 18:26:51 EST 2013

----- Original Message -----
> From: "Brian J. Murrell" <brian at interlinx.bc.ca>
> To: pacemaker at clusterlabs.org
> Sent: Monday, December 2, 2013 2:50:41 PM
> Subject: [Pacemaker] catch-22: can't fence node A because node A has the	fencing resource
> 
> So, I'm migrating my working pacemaker configuration from 1.1.7 to
> 1.1.10 and am finding what appears to be a new behavior in 1.1.10.
> 
> If a given node is running a fencing resource and that node goes AWOL,
> it needs to be fenced (of course).  But any other node trying to take
> over the fencing resource to fence it appears to first want the current
> owner of the fencing resource to fence the node.  Of course that can't
> happen since the node that needs to do the fencing is AWOL.
> 
> So while I can buy into the general policy that a node needs to be
> fenced in order to take over it's resources, fencing resources have to
> be excepted from this or there can be this catch-22.

We did away with all of the policy engine logic involved with trying to move fencing devices off of the target node before executing the fencing action. Behind the scenes all fencing devices are now essentially clones.  If the target node to be fenced has a fencing device running on it, that device can execute anywhere in the cluster to avoid the "suicide" situation.

When you are looking at crm_mon output and see a fencing device is running on a specific node, all that really means is that we are going to attempt to execute fencing actions for that device from that node first. If that node is unavailable, we'll try that same device anywhere in the cluster we can get it to work (unless you've specifically built some location constraint that prevents the fencing device from ever running on a specific node)

Hope that helps.

-- Vossel

> 
> I believe that is how things were working in 1.1.7 but now that I'm on
> 1.1.10[-1.el6_4.4] this no longer seems to be the case.
> 
> Or perhaps there is some additional configuration that 1.1.10 needs to
> effect this behavior.  Here is my configuration:
> 
> Cluster Name:
> Corosync Nodes:
>  
> Pacemaker Nodes:
>  host1 host2
> 
> Resources:
>  Resource: rsc1 (class=ocf provider=foo type=Target)
>   Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d
>   Meta Attrs: target-role=Started
>   Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
>               start interval=0 timeout=300 (rsc1-start-0)
>               stop interval=0 timeout=300 (rsc1-stop-0)
>  Resource: rsc2 (class=ocf provider=chroma type=Target)
>   Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515
>   Meta Attrs: target-role=Started
>   Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
>               start interval=0 timeout=300 (rsc2-start-0)
>               stop interval=0 timeout=300 (rsc2-stop-0)
> 
> Stonith Devices:
>  Resource: st-fencing (class=stonith type=fence_foo)
> Fencing Levels:
> 
> Location Constraints:
>   Resource: rsc1
>     Enabled on: host1 (score:20) (id:rsc1-primary)
>     Enabled on: host2 (score:10) (id:rsc1-secondary)
>   Resource: rsc2
>     Enabled on: host2 (score:20) (id:rsc2-primary)
>     Enabled on: host1 (score:10) (id:rsc2-secondary)
> Ordering Constraints:
> Colocation Constraints:
> 
> Cluster Properties:
>  cluster-infrastructure: classic openais (with plugin)
>  dc-version: 1.1.10-1.el6_4.4-368c726
>  expected-quorum-votes: 2
>  no-quorum-policy: ignore
>  stonith-enabled: true
>  symmetric-cluster: true
> 
> One thing that PCS is not showing that might be relevant here is that I
> have a a resource stickiness value set to 1000 to prevent resources from
> failing back to nodes after a failover.
> 
> With the above configuration if host1 is shut down, host2 just spins in
> a loop doing:
> 
> Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
> be fenced because the node is no longer part of the cluster
> Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node
> host1 is unclean
> Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
> st-fencing_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action
> rsc1_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
> for STONITH
> Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
> st-fencing#011(Started host1 -> host2)
> Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move
> rsc1#011(Started host1 -> host2)
> Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
> fencing operation (13) on host1 (timeout=60000)
> Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client
> crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
> Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for host1:
> ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
> Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated
> Transition 22: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec  2 20:01:14 host2 stonith-ng[8920]:    error: remote_op_done: Operation
> reboot of host1 by host2 for crmd.8924 at host2.ad69ead5: Timer expired
> Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 4/13:22:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired
> (-62)
> Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 4 for host1 failed (Timer expired): aborting transition.
> Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer
> host1 was not terminated (reboot) by host2 for host2: Timer expired
> (ref=ad69ead5-0bbb-45d8-bb07-30bcd405ace2) by client crmd.8924
> Dec  2 20:01:14 host2 crmd[8924]:   notice: run_graph: Transition 22
> (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Dec  2 20:01:14 host2 pengine[8923]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Dec  2 20:01:14 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
> be fenced because the node is no longer part of the cluster
> Dec  2 20:01:14 host2 pengine[8923]:  warning: determine_online_status: Node
> host1 is unclean
> Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action
> st-fencing_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action
> rsc1_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:01:14 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
> for STONITH
> Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move
> st-fencing#011(Started host1 -> host2)
> Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move
> rsc1#011(Started host1 -> host2)
> Dec  2 20:01:14 host2 pengine[8923]:  warning: process_pe_message: Calculated
> Transition 23: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec  2 20:01:14 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
> fencing operation (13) on host1 (timeout=60000)
> Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: handle_request: Client
> crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
> Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for host1:
> 4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6 (0)
> Dec  2 20:02:26 host2 stonith-ng[8920]:    error: remote_op_done: Operation
> reboot of host1 by host2 for crmd.8924 at host2.4c3f947b: Timer expired
> Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 5/13:23:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired
> (-62)
> Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 5 for host1 failed (Timer expired): aborting transition.
> Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer
> host1 was not terminated (reboot) by host2 for host2: Timer expired
> (ref=4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6) by client crmd.8924
> Dec  2 20:02:26 host2 crmd[8924]:   notice: run_graph: Transition 23
> (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Dec  2 20:02:26 host2 pengine[8923]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Dec  2 20:02:26 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
> be fenced because the node is no longer part of the cluster
> Dec  2 20:02:26 host2 pengine[8923]:  warning: determine_online_status: Node
> host1 is unclean
> Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action
> st-fencing_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action
> rsc1_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:02:26 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
> for STONITH
> Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move
> st-fencing#011(Started host1 -> host2)
> Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move
> rsc1#011(Started host1 -> host2)
> Dec  2 20:02:26 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
> fencing operation (13) on host1 (timeout=60000)
> Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: handle_request: Client
> crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
> Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for host1:
> 4b9c1ffc-3029-4b6a-8128-63c05f0ef8de (0)
> Dec  2 20:02:26 host2 pengine[8923]:  warning: process_pe_message: Calculated
> Transition 24: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec  2 20:03:38 host2 stonith-ng[8920]:    error: remote_op_done: Operation
> reboot of host1 by host2 for crmd.8924 at host2.4b9c1ffc: Timer expired
> Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 6/13:24:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired
> (-62)
> Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 6 for host1 failed (Timer expired): aborting transition.
> Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer
> host1 was not terminated (reboot) by host2 for host2: Timer expired
> (ref=4b9c1ffc-3029-4b6a-8128-63c05f0ef8de) by client crmd.8924
> Dec  2 20:03:38 host2 crmd[8924]:   notice: run_graph: Transition 24
> (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Dec  2 20:03:38 host2 pengine[8923]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Dec  2 20:03:38 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
> be fenced because the node is no longer part of the cluster
> Dec  2 20:03:38 host2 pengine[8923]:  warning: determine_online_status: Node
> host1 is unclean
> Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action
> st-fencing_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action
> rsc1_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:03:38 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
> for STONITH
> Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move
> st-fencing#011(Started host1 -> host2)
> Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move
> rsc1#011(Started host1 -> host2)
> Dec  2 20:03:38 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
> fencing operation (13) on host1 (timeout=60000)
> Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: handle_request: Client
> crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
> Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for host1:
> 8200c15c-d138-4b0a-b6df-ac6fe6e46ef1 (0)
> Dec  2 20:03:38 host2 pengine[8923]:  warning: process_pe_message: Calculated
> Transition 25: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec  2 20:04:50 host2 stonith-ng[8920]:    error: remote_op_done: Operation
> reboot of host1 by host2 for crmd.8924 at host2.8200c15c: Timer expired
> Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 7/13:25:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired
> (-62)
> Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith
> operation 7 for host1 failed (Timer expired): aborting transition.
> Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer
> host1 was not terminated (reboot) by host2 for host2: Timer expired
> (ref=8200c15c-d138-4b0a-b6df-ac6fe6e46ef1) by client crmd.8924
> Dec  2 20:04:50 host2 crmd[8924]:   notice: run_graph: Transition 25
> (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
> Dec  2 20:04:50 host2 pengine[8923]:   notice: unpack_config: On loss of CCM
> Quorum: Ignore
> Dec  2 20:04:50 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will
> be fenced because the node is no longer part of the cluster
> Dec  2 20:04:50 host2 pengine[8923]:  warning: determine_online_status: Node
> host1 is unclean
> Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action
> st-fencing_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action
> rsc1_stop_0 on host1 is unrunnable (offline)
> Dec  2 20:04:50 host2 pengine[8923]:  warning: stage6: Scheduling Node host1
> for STONITH
> Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move
> st-fencing#011(Started host1 -> host2)
> Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move
> rsc1#011(Started host1 -> host2)
> Dec  2 20:04:50 host2 pengine[8923]:  warning: process_pe_message: Calculated
> Transition 26: /var/lib/pacemaker/pengine/pe-warn-2.bz2
> Dec  2 20:04:50 host2 crmd[8924]:   notice: te_fence_node: Executing reboot
> fencing operation (13) on host1 (timeout=60000)
> Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: handle_request: Client
> crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
> Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op:
> Initiating remote operation reboot for host1:
> 8ceabae8-6876-4d6d-b44c-c64c0863f68c (0)
> 
> So is there something new about 1.1.10 that I am missing?
> 
> Cheers,
> b.
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>