[Pacemaker] question about stonith:external/libvirt
Matthew O'Connor
matt at ecsorl.com
Sun May 20 04:40:20 UTC 2012
After using the tutorial on the Hastexo site for setting up stonith via
libvirt, I believe I have it working correctly...but...some strange
things are happening. I have two nodes, with shared storage provided by
a dual-primary DRBD resource and OCFS2. Here is one of my stonith
primitives:
primitive p_fence-l2 stonith:external/libvirt \
params hostlist="l2:l2.sandbox"
hypervisor_uri="qemu+ssh://matt@hv01/system" stonith-timeout="30"
pcmk_host_check="none" \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15" \
op monitor interval="60" \
meta target-role="Started"
This cluster has stonith-enabled="true" in the cluster options, plus the
necessary location statements in the cib.
To watch the DLM, I run dbench on the shared storage on the node I let
live. While it's running, I creatively nuke the other node. If I just
"killall pacemakerd" on l2 for instance, the DLM seems unaffected and
the fence takes place, rebooting the now "failed" node l2. No real
interruption of service on the surviving node, l3. Yet, if I "halt -f
-n" on l2, the fence still takes place but the surviving node's (l3's)
DLM hangs and won't come back until I bring the failed node back
online. Note that l2 and l3 can be interchanged - the results are the
same. Note that when the DLM is hung as in the latter case, eventually
kernel messages about hung tasks start populating the syslog.
I thought I had recently read some posts concerning this very topic, but
for the life of me I can't find them...
Any ideas on how I should proceed, or what I should look for next?
Thanks!
-- Matt
More information about the Pacemaker
mailing list