<div dir="ltr"><div>In my case I should mention that stonithing works occasionally when the SBD resource is defined on one node only, but not too often. Unfortunately I can&#39;t seem to find a pattern when it&#39;s working or failing. What I&#39;m curious about is the following lines in the log file:</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster</div><div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: determine_online_status: Node slesha1n2i-u is unclean</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)</div><div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling Node slesha1n2i-u for STONITH</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move    stonith_sbd   (Started slesha1n2i-u -&gt; slesha1n1i-u)</div><div> ...</div><div> Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: te_fence_node: Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000)</div>

<div> Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice: handle_request: Client crmd.8916.3144546f wants to fence (reboot) &#39;slesha1n2i-u&#39; with device &#39;(any)&#39;</div><div> Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0)</div>

<div> Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:    error: remote_op_done: Operation reboot of slesha1n2i-u by slesha1n1i-u for crmd.8916@slesha1n1i-u.8c00ff7b: No route to host</div><div> Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: tengine_stonith_callback: Stonith operation 3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113)</div>

<div> Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No route to host): aborting transition.</div><div> Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: tengine_stonith_notify: Peer slesha1n2i-u was not terminated (st_notify_fence) by slesha1n1i-u for slesha1n1i-u: No route to host (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b) by client crmd.8916</div>

<div> Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: run_graph: Transition 3 (Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped</div><div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: unpack_config: On loss of CCM Quorum: Ignore</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node slesha1n2i-u will be fenced because the node is no longer part of the cluster</div><div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: determine_online_status: Node slesha1n2i-u is unclean</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action: Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)</div><div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling Node slesha1n2i-u for STONITH</div>

<div> Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move    stonith_sbd   (Started slesha1n2i-u -&gt; slesha1n1i-u)</div><div> ...</div><div> Aug  1 12:00:02 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures: Too many failures to fence slesha1n2i-u (11), giving up</div>

<div> </div><div> </div><div>What does this mean?</div><div>ne_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No route to host): aborting transition.</div><div><br></div><div>Of course there is no route to the other host, as the network interface is down on the other node. The SBD stonith operation shouldn&#39;t be dependent on the network connection at all?</div>

<div><br></div><div><br></div><div>I have also been testing another case where I define the SBD resource on both nodes (which is not recommended as I understand). In this case stonithing works just fine - always. Thus SBD messaging must be working as it should. I also tested to fence the other node with the sbd command, and it always works. So I&#39;m still confused why SBD stonithing does not work when the resource is defined on one node only.</div>

<div><br></div><div><br></div><div>Regards</div><div>Jan C</div><div class="gmail_extra"><br><br><div class="gmail_quote">On Wed, Aug 21, 2013 at 4:27 PM, Lars Marowsky-Bree <span dir="ltr">&lt;<a href="mailto:lmb@suse.com" target="_blank">lmb@suse.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 2013-08-20T08:52:00, &quot;Angel L. Mateo&quot; &lt;<a href="mailto:amateo@um.es">amateo@um.es</a>&gt; wrote:<br>

<br>

Sorry, I was on vacation for a few weeks, thus only chiming in now.<br>

<br>

Instead of the Linux-HA Wiki page, please look here for the<br>

documentation: <a href="https://github.com/l-mb/sbd/blob/master/man/sbd.8.pod" target="_blank">https://github.com/l-mb/sbd/blob/master/man/sbd.8.pod</a><br>

<br>

(Or, on a system with sbd installed, simply type &quot;man sbd&quot;)<br>

<br>

The most common problems for fencing failures with SBD:<br>

<br>

- Pacemaker&#39;s stonith-timeout is not long enough to account for sbd&#39;s<br>

  msgwait. It needs to be at least 50% larger. (Pacemaker uses some of<br>

  the stonith-timeout for the look-up phase, and it isn&#39;t available for<br>

  the actual fence request.)<br>

<br>

- The storage is not truly shared.<br>

<br>

  Then the node can&#39;t actually &quot;see&quot; the other, and will not be able to<br>

  find the messaging slot. Hence, fencing will fail.<br>

<div class="im"><br>

&gt;       For me to work (ubuntu 12.04) I had to create /etc/sysconfig/sbd file with:<br>

&gt;<br>

&gt; SBD_DEVICE=&quot;/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1&quot;<br>

&gt; SBD_OPTS=&quot;-W&quot;<br>

&gt;<br>

&gt;       and the resource configuration is<br>

&gt;<br>

&gt; primitive stonith_sbd stonith:external/sbd \<br>

&gt;         params<br>

&gt; sbd_device=&quot;/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1&quot; \<br>

&gt;         meta target-role=&quot;Started&quot;<br>

<br>

</div>In the newer versions, it is not necessary to have the &quot;params&quot; on the<br>

primitive anymore - it&#39;ll read the /etc/sysconfig/sbd file. Overriding<br>

that shouldn&#39;t be really necessary.<br>

<br>

I can assure you that sbd fencing is working fine in SLE HA 11 SP3, or<br>

my lab cluster would never complete a single fence successfully ;-)<br>

<br>

<br>

Regards,<br>

    Lars<br>

<span class="HOEnZb"><font color="#888888"><br>

--<br>

Architect Storage/HA<br>

SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)<br>

&quot;Experience is the name everyone gives to their mistakes.&quot; -- Oscar Wilde<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

<br>

_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br>mvh<div>Jan Christian</div>

</div></div>