[Pacemaker] catch-22: can't fence node A because node A has the fencing resource

Wed Jan 8 22:17:33 UTC 2014

On 4 Dec 2013, at 11:47 am, Brian J. Murrell <brian at interlinx.bc.ca> wrote:

> 
> On Tue, 2013-12-03 at 18:26 -0500, David Vossel wrote: 
>> 
>> We did away with all of the policy engine logic involved with trying to move fencing devices off of the target node before executing the fencing action. Behind the scenes all fencing devices are now essentially clones.  If the target node to be fenced has a fencing device running on it, that device can execute anywhere in the cluster to avoid the "suicide" situation.
> 
> OK.
> 
>> When you are looking at crm_mon output and see a fencing device is running on a specific node, all that really means is that we are going to attempt to execute fencing actions for that device from that node first. If that node is unavailable,
> 
> Would it be better to not even try to use a node and ask it to commit
> suicide but always try to use another node?

IIRC the only time we ask a node to fence itself is when it is (or thinks it is) the last node standing.

> 
>> we'll try that same device anywhere in the cluster we can get it to work
> 
> OK.
> 
>> (unless you've specifically built some location constraint that prevents the fencing device from ever running on a specific node)
> 
> While I do have constraints on the more service-oriented resources to
> give them preferred nodes, I don't have any constraints on the fencing
> resources.
> 
> So given all of the above, and given the log I supplied showing that the
> fencing was just not being attempted anywhere other than the node to be
> fenced (which was down during that log) any clues as to where to look
> for why?
> 
>> Hope that helps.
> 
> It explains the differences, but unfortunately I'm still not sure why it
> wouldn't get run somewhere else, eventually, rather than continually
> being attempted on the node to be killed (which as I mentioned, was shut
> down at the time the log was made).

Yes, this is surprising.
Can you enable the blackbox for stonith-ng, reproduce and generate a crm_report for us please?  It will contain all the information we need.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140109/bdef1d96/attachment-0003.sig>