[Pacemaker] question about stonith:external/libvirt

Sun May 20 04:40:20 UTC 2012

After using the tutorial on the Hastexo site for setting up stonith via 
libvirt, I believe I have it working correctly...but...some strange 
things are happening.  I have two nodes, with shared storage provided by 
a dual-primary DRBD resource and OCFS2.  Here is one of my stonith 
primitives:

primitive p_fence-l2 stonith:external/libvirt \
         params hostlist="l2:l2.sandbox" 
hypervisor_uri="qemu+ssh://matt@hv01/system" stonith-timeout="30" 
pcmk_host_check="none" \
         op start interval="0" timeout="15" \
         op stop interval="0" timeout="15" \
         op monitor interval="60" \
         meta target-role="Started"

This cluster has stonith-enabled="true" in the cluster options, plus the 
necessary location statements in the cib.

To watch the DLM, I run dbench on the shared storage on the node I let 
live.  While it's running, I creatively nuke the other node.  If I just 
"killall pacemakerd" on l2 for instance, the DLM seems unaffected and 
the fence takes place, rebooting the now "failed" node l2.  No real 
interruption of service on the surviving node, l3.  Yet, if I "halt -f 
-n" on l2, the fence still takes place but the surviving node's (l3's) 
DLM hangs and won't come back until I bring the failed node back 
online.  Note that l2 and l3 can be interchanged - the results are the 
same.  Note that when the DLM is hung as in the latter case, eventually 
kernel messages about hung tasks start populating the syslog.

I thought I had recently read some posts concerning this very topic, but 
for the life of me I can't find them...
Any ideas on how I should proceed, or what I should look for next?

Thanks!
-- Matt