[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

Mon Dec 2 15:50:41 EST 2013

So, I'm migrating my working pacemaker configuration from 1.1.7 to
1.1.10 and am finding what appears to be a new behavior in 1.1.10.

If a given node is running a fencing resource and that node goes AWOL,
it needs to be fenced (of course).  But any other node trying to take
over the fencing resource to fence it appears to first want the current
owner of the fencing resource to fence the node.  Of course that can't
happen since the node that needs to do the fencing is AWOL.

So while I can buy into the general policy that a node needs to be
fenced in order to take over it's resources, fencing resources have to
be excepted from this or there can be this catch-22.

I believe that is how things were working in 1.1.7 but now that I'm on
1.1.10[-1.el6_4.4] this no longer seems to be the case.

Or perhaps there is some additional configuration that 1.1.10 needs to
effect this behavior.  Here is my configuration:

Cluster Name: 
Corosync Nodes:

Pacemaker Nodes:
 host1 host2 

Resources: 
 Resource: rsc1 (class=ocf provider=foo type=Target)
  Attributes: target=111bad0a-a86a-40e3-b056-c5c93168aa0d 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc1-monitor-5)
              start interval=0 timeout=300 (rsc1-start-0)
              stop interval=0 timeout=300 (rsc1-stop-0)
 Resource: rsc2 (class=ocf provider=chroma type=Target)
  Attributes: target=a8efa349-4c73-4efc-90d3-d6be7d73c515 
  Meta Attrs: target-role=Started 
  Operations: monitor interval=5 timeout=60 (rsc2-monitor-5)
              start interval=0 timeout=300 (rsc2-start-0)
              stop interval=0 timeout=300 (rsc2-stop-0)

Stonith Devices: 
 Resource: st-fencing (class=stonith type=fence_foo)
Fencing Levels: 

Location Constraints:
  Resource: rsc1
    Enabled on: host1 (score:20) (id:rsc1-primary)
    Enabled on: host2 (score:10) (id:rsc1-secondary)
  Resource: rsc2
    Enabled on: host2 (score:20) (id:rsc2-primary)
    Enabled on: host1 (score:10) (id:rsc2-secondary)
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: classic openais (with plugin)
 dc-version: 1.1.10-1.el6_4.4-368c726
 expected-quorum-votes: 2
 no-quorum-policy: ignore
 stonith-enabled: true
 symmetric-cluster: true

One thing that PCS is not showing that might be relevant here is that I
have a a resource stickiness value set to 1000 to prevent resources from
failing back to nodes after a failover.

With the above configuration if host1 is shut down, host2 just spins in
a loop doing:

Dec  2 20:00:02 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster
Dec  2 20:00:02 host2 pengine[8923]:  warning: determine_online_status: Node host1 is unclean
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:00:02 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 for STONITH
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move    st-fencing#011(Started host1 -> host2)
Dec  2 20:00:02 host2 pengine[8923]:   notice: LogActions: Move    rsc1#011(Started host1 -> host2)
Dec  2 20:00:02 host2 crmd[8924]:   notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000)
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:00:02 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: ad69ead5-0bbb-45d8-bb07-30bcd405ace2 (0)
Dec  2 20:00:02 host2 pengine[8923]:  warning: process_pe_message: Calculated Transition 22: /var/lib/pacemaker/pengine/pe-warn-2.bz2  
Dec  2 20:01:14 host2 stonith-ng[8920]:    error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924 at host2.ad69ead5: Timer expired
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 4/13:22:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 4 for host1 failed (Timer expired): aborting transition.
Dec  2 20:01:14 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=ad69ead5-0bbb-45d8-bb07-30bcd405ace2) by client crmd.8924
Dec  2 20:01:14 host2 crmd[8924]:   notice: run_graph: Transition 22 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:01:14 host2 pengine[8923]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  2 20:01:14 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster  
Dec  2 20:01:14 host2 pengine[8923]:  warning: determine_online_status: Node host1 is unclean
Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:01:14 host2 pengine[8923]:  warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline)  
Dec  2 20:01:14 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 for STONITH
Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move    st-fencing#011(Started host1 -> host2)
Dec  2 20:01:14 host2 pengine[8923]:   notice: LogActions: Move    rsc1#011(Started host1 -> host2)
Dec  2 20:01:14 host2 pengine[8923]:  warning: process_pe_message: Calculated Transition 23: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:01:14 host2 crmd[8924]:   notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000)  
Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:01:14 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6 (0)
Dec  2 20:02:26 host2 stonith-ng[8920]:    error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924 at host2.4c3f947b: Timer expired
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 5/13:23:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 5 for host1 failed (Timer expired): aborting transition.  
Dec  2 20:02:26 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=4c3f947b-12a7-4b6f-84a9-c5ddcbeb55c6) by client crmd.8924  
Dec  2 20:02:26 host2 crmd[8924]:   notice: run_graph: Transition 23 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:02:26 host2 pengine[8923]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  2 20:02:26 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster
Dec  2 20:02:26 host2 pengine[8923]:  warning: determine_online_status: Node host1 is unclean
Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:02:26 host2 pengine[8923]:  warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:02:26 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 for STONITH
Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move    st-fencing#011(Started host1 -> host2)
Dec  2 20:02:26 host2 pengine[8923]:   notice: LogActions: Move    rsc1#011(Started host1 -> host2)
Dec  2 20:02:26 host2 crmd[8924]:   notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000)
Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:02:26 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 4b9c1ffc-3029-4b6a-8128-63c05f0ef8de (0)
Dec  2 20:02:26 host2 pengine[8923]:  warning: process_pe_message: Calculated Transition 24: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:03:38 host2 stonith-ng[8920]:    error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924 at host2.4b9c1ffc: Timer expired
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 6/13:24:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)  
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 6 for host1 failed (Timer expired): aborting transition.
Dec  2 20:03:38 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=4b9c1ffc-3029-4b6a-8128-63c05f0ef8de) by client crmd.8924
Dec  2 20:03:38 host2 crmd[8924]:   notice: run_graph: Transition 24 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:03:38 host2 pengine[8923]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  2 20:03:38 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster
Dec  2 20:03:38 host2 pengine[8923]:  warning: determine_online_status: Node host1 is unclean
Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:03:38 host2 pengine[8923]:  warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:03:38 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 for STONITH
Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move    st-fencing#011(Started host1 -> host2)
Dec  2 20:03:38 host2 pengine[8923]:   notice: LogActions: Move    rsc1#011(Started host1 -> host2)
Dec  2 20:03:38 host2 crmd[8924]:   notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000)  
Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:03:38 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 8200c15c-d138-4b0a-b6df-ac6fe6e46ef1 (0)
Dec  2 20:03:38 host2 pengine[8923]:  warning: process_pe_message: Calculated Transition 25: /var/lib/pacemaker/pengine/pe-warn-2.bz2
Dec  2 20:04:50 host2 stonith-ng[8920]:    error: remote_op_done: Operation reboot of host1 by host2 for crmd.8924 at host2.8200c15c: Timer expired
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 7/13:25:0:0171e376-182e-485f-a484-9e638e1bd355: Timer expired (-62)
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_callback: Stonith operation 7 for host1 failed (Timer expired): aborting transition.
Dec  2 20:04:50 host2 crmd[8924]:   notice: tengine_stonith_notify: Peer host1 was not terminated (reboot) by host2 for host2: Timer expired (ref=8200c15c-d138-4b0a-b6df-ac6fe6e46ef1) by client crmd.8924
Dec  2 20:04:50 host2 crmd[8924]:   notice: run_graph: Transition 25 (Complete=1, Pending=0, Fired=0, Skipped=7, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-2.bz2): Stopped
Dec  2 20:04:50 host2 pengine[8923]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Dec  2 20:04:50 host2 pengine[8923]:  warning: pe_fence_node: Node host1 will be fenced because the node is no longer part of the cluster
Dec  2 20:04:50 host2 pengine[8923]:  warning: determine_online_status: Node host1 is unclean
Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action st-fencing_stop_0 on host1 is unrunnable (offline)
Dec  2 20:04:50 host2 pengine[8923]:  warning: custom_action: Action rsc1_stop_0 on host1 is unrunnable (offline)
Dec  2 20:04:50 host2 pengine[8923]:  warning: stage6: Scheduling Node host1 for STONITH
Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move    st-fencing#011(Started host1 -> host2)
Dec  2 20:04:50 host2 pengine[8923]:   notice: LogActions: Move    rsc1#011(Started host1 -> host2)
Dec  2 20:04:50 host2 pengine[8923]:  warning: process_pe_message: Calculated Transition 26: /var/lib/pacemaker/pengine/pe-warn-2.bz2  
Dec  2 20:04:50 host2 crmd[8924]:   notice: te_fence_node: Executing reboot fencing operation (13) on host1 (timeout=60000)
Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: handle_request: Client crmd.8924.39504cd3 wants to fence (reboot) 'host1' with device '(any)'
Dec  2 20:04:50 host2 stonith-ng[8920]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for host1: 8ceabae8-6876-4d6d-b44c-c64c0863f68c (0)

So is there something new about 1.1.10 that I am missing?

Cheers,
b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131202/d40ea577/attachment-0002.sig>