<div dir="ltr"><div><div><div><div><div><div>Hi, Digimer:<br>Thanks for the detailed explanation.<br>I followed the guide from clusterlab doc and configure below IPMI based stonith resources for my DRBD related service:<br>

primitive suse2-stonith stonith:external/ipmi \<br>        params hostname=&quot;suse2&quot; ipaddr=&quot;XXX&quot; userid=&quot;admin&quot; passwd=&quot;xxx&quot; interface=&quot;lan&quot;<br>primitive suse4-stonith stonith:external/ipmi \<br>

        params hostname=&quot;suse4&quot; ipaddr=&quot;YYY&quot; userid=&quot;admin&quot; passwd=&quot;yyy&quot; interface=&quot;lan&quot;<br>location st-suse2 suse2-stonith -inf: suse2<br>location st-suse4 suse4-stonith -inf: suse4<br>

<br></div>After enabling IPMI device and channel authentication, I use below command to cut down the link of DRBD primary machine:<br></div>iptables -A INPUT -j DROP<br></div>After about 1 seconds, I can see that pacemaker scheduled the secondary DRBD machine sends IPMI reset command to power cycle the primary machine. However, the bad thing is that the resource is keeping &quot;Stopped&quot;, and never failed over to the secondary machine.<br>

</div>&quot;crm status&quot; shows that the primary machine is under &quot;OFFLINE&quot; statue, and  all resource are not started on the secondary machine which is supposed to be.<br></div>Is this because that the failed primary node is fenced so that it blocked pacemaker to schedule the resource on the secondary machine?<br>

</div>Your hints are really appreciated.<br>Thanks.<br></div><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, Aug 29, 2013 at 1:55 AM, Digimer <span dir="ltr">&lt;<a href="mailto:lists@alteeve.ca" target="_blank">lists@alteeve.ca</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On 28/08/13 13:13, Xiaomin Zhang wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi, Gurus:<br>

I&#39;ve a simple master-slave setup for a mirrored DRBD storage: This<br>

storage is written by a daemon Java application server to produce<br>

transaction data.<br>

node Lhs072gkz \<br>

         attributes standby=&quot;on&quot;<br>

node Lpplj9jb4<br>

node Lvoim0kaw<br>

primitive drbd1 ocf:linbit:drbd \<br>

         params drbd_resource=&quot;r0&quot; \<br>

         op monitor interval=&quot;15s&quot;<br>

ms ms_drbd1 drbd1 \<br>

         meta master-max=&quot;1&quot; master-node-max=&quot;1&quot; clone-max=&quot;2&quot;<br>

clone-node-max=&quot;1&quot; notify=&quot;true&quot; target-role=&quot;Started&quot;<br>

location drbd-fence-by-handler-ms_drbd1 ms_drbd1 \<br>

         rule $id=&quot;drbd-fence-by-handler-<u></u>rule-ms_drbd1&quot; $role=&quot;Master&quot;<br>

-inf: #uname ne<br>

Lpplj9jb4<br>

<br>

It seems Split-Brains is very likely to happen when I reboot the slave<br>

machine even the Java application is just writing nothing on the DRBD<br>

storage.<br>

Is this an expected behavior?<br>

<br>

And I found some topics about automatically recover from Split-Brain for<br>

DRBD () It just says to put some configurations in DRBD, all things<br>

should work. Is this a good practice?<br>

Thanks.<br>

</blockquote>

<br></div></div>

No, split-brains are not at all expected behaviour, but they happen when things are not setup properly.<br>

<br>

The best thing to do is to avoid a split-brain in the first place, which is easy to do if you setup (working) stonith/fencing.<br>

<br>

If you configure stonith in pacemaker using IPMI (the most common method) and test it to make sure nodes reboot on failure, you can then &quot;hook&quot; drbd into pacemaker&#39;s fencing. You do this by setting the fence policy to &quot;resource-and-stonith&quot; and then tell DRBD to use the &quot;crm-fence-peer.sh&quot; fence handler.<br>


<br>

This tells DRBD that, if the peer fails (or vanishes), to block IO and call a fence. The fence handler is then invoked which calls pacemaker and says &quot;please fence node X&quot;. When pacemaker succeeds, it will tell the handler which in turn tells DRBD that it&#39;s now safe to resume IO. One of the nodes will be dead so you will avoid the split-brain in the first place.<br>


<br>

If your servers have IPMI, iLO, iDRAC, RSA, etc, you can use the &#39;fence_ipmilan&#39; fence agent in your pacemaker configuration. If you need help with this, just say.<br>

<br>

Cheers<span class="HOEnZb"><font color="#888888"><br>

<br>

digimer<br>

<br>

-- <br>

Digimer<br>

Papers and Projects: <a href="https://alteeve.ca/w/" target="_blank">https://alteeve.ca/w/</a><br>

What if the cure for cancer is trapped in the mind of a person without access to education?<br>

</font></span></blockquote></div><br></div>