[Pacemaker] node1 fencing itself after node2 being fenced

Asgaroth lists at blueface.com
Tue Feb 18 05:12:16 EST 2014


> 
> The 3rd node should (and needs to be) fenced at this point to allow the
> cluster to continue.
> Is this not happening?

The fencing operation appears to complete successfully, here is the
sequence:

[1] All 3 nodes running properly
[2] On node 3 I run "echo c > /proc/sysrq-trigger" which "hangs" node3
[3] The fence_test03 resources executes a fence operation on node 3 (fires a
shutdown/startup on the vm)
[4] dlm shows kern_stop state while node 3 is being fenced
[5] node 3 reboots, and node 1 & 2 operate as normal (clvmd and gfs2 work
properly, dlm notified that fence successful (2 members in each lock group))
[6] While node 3 is booting, cman starts properly then clvmd starts but
hangs on boot
[7] While node 3 is "hung" at the clvmd stage, node 1 & 2 are unable to
perform lvm operations due to node 3 attempting to join the clvmd "group".
Dlm shows that node 3 is a member, cman sees node 3 as a cluster member,
however, pacemaker has not started as clvmd is not successfully started.

Because pacemaker is not "up" and because I do not have clvmd as a resource
definition, there is no fence performed if/when clvmd fails.

Other than the above, fencing appears to be working properly. Are there some
other fencing tests you may like me to perform to verify that fencing is
working as expected?

> 
> Did you specify on-fail=fence for the clvmd agent?
> 


Hmmm, I don't have any clvmd agents defined within pacemaker at the moment
as I am starting clvmd outside of pacemaker control.

In my original post I had clvmd and dlm defined as a clone resource under
pacemaker control. My understanding from the responses to that post was to
remove those resources from pacemaker control and run clvmd on boot and dlm
would be managed by cman startup. Are you saying that I should have
dlm/clvmd defined as pacemaker resources and still have clvmd start on
bootup?

For example, originally I defined dlm/clvmd under pacemaker control as
follows:

pcs resource create dlm ocf:pacemaker:controld op monitor interval=30s
on-fail=fence clone interleave=true ordered=true
pcs resource create clvmd lsb:clvmd op monitor interval=30s on-fail=fence
clone interleave=true ordered=true

However, right now, the above two resource definitions have been removed
from pacemaker.

Thanks for your time (and others too) thus far in assisting me with this
issue.

Thanks





More information about the Pacemaker mailing list