Thank for your reply Andreas,<br><br>My fisrt node is a virtual machine (active node), the second (passive node) is physical standalone server, <span id="result_box" class="short_text" lang="en"><span class="hps">there is</span> <span class="hps">no</span> <span class="hps">high load</span> <span class="hps">on any</span></span> of them <span id="result_box" class="" lang="en"><span class="hps">but the problem</span> <span class="hps">seems</span> <span class="hps">to come from the</span> <span class="hps">virtual server.<br>


</span></span><span id="result_box" class="" lang="en"><span class="hps">I actually have</span> <span class="hps">the same problem</span> <span class="hps">of</span> <span class="hps">split brain</span> <span class="hps">when I</span> <span class="hps">take  or delete a</span> <span class="hps">virtual machine snapshot</span><span class="hps"></span></span> (<span id="result_box" class="" lang="en"><span class="hps">network connection</span> is lost <span class="hps">for a few moment, maybe about 1s</span></span>). But<span id="result_box" class="" lang="en"><span class="hps"> i</span> <span class="hps">take </span><span class="hps">snapshot</span> only <span class="hps">once a week</span><span>,</span> <span class="hps">and I have</span> <span class="hps">split brain</span> <span class="hps">several times in a</span> <span class="hps">week</span></span>. <br>


<span id="result_box" class="short_text" lang="en"><span class="hps">I did</span>n&#39;t <span class="hps">detect any</span> <span class="hps">other loss of</span> <span class="hps">connection</span></span>,<span id="result_box" class="" lang="en"><span class="hps"> or perhaps</span> <span class="hps">it</span> <span class="hps">is</span> <span class="hps">micro</span> <span class="hps">network cuts</span> <span class="hps">that are not</span> <span class="hps">detected</span> <span class="hps">by my</span> <span class="hps">monitoring system</span></span><span id="result_box" class="" lang="en"><span class="hps"> (and </span></span><span id="result_box" class="short_text" lang="en"><span class="hps">I</span> <span class="hps">have no problem</span> <span class="hps">with</span> <span class="hps">my </span><span class="hps">nonclustered</span></span> services).<br>


In case of microcuts, i think the problem is DRBD, <span id="result_box" class="short_text" lang="en"><span class="hps">is it</span> <span class="hps">too sensitive</span><span class="">?</span></span> can i adjust values to avoid the problem?<br>


<br><br>I will try increase my token value to 10000 / consensus to 12000 and configure resource-level fencing in DRBD<span id="result_box" class="short_text" lang="en"><span class="hps"></span></span>, thanks for the tips.<br>


<span id="result_box" class="" lang="en"><span class="hps"></span></span><br>About redundant rings, I read on the DRBD documentation that it is vital for the resource level fencing, but can i do without?<br>Because i use a virtual server (my virtual servers are on a blade) i can&#39;t have &quot;physical&quot; link between the 2 nodes (cable between the 2 nodes), so i use &quot;virtual links&quot; (with vlan to separate them from my main network). I can create a 2nd corosync link but I doubt its usefulness, if something goes wrong with the first link, I think i would have the same problem on the second. Although they are virtually separated, they use the same physical hardware (All my hardware is redondant therefore link problems are very limited). <br>


But maybe i&#39;ve wrong,<span id="result_box" class="short_text" lang="en"><span class="hps"> I</span><span class="">&#39;ll think</span> <span class="hps">about it</span></span>.<br><br><br><span id="result_box" class="" lang="en"><span class="hps">About</span> <span class="hps">stonith</span><span class="">, I will read</span> <span class="hps">the</span> <span class="hps">documentation</span><span>,</span> <span class="hps">but is it</span> <span class="hps">really useful</span> <span class="hps">to get out</span> <span class="hps">the &quot;big artillery&quot;</span> <span class="hps">for a simple</span> <span class="hps">2-node</span> <span class="hps">cluster</span> <span class="hps">in</span> <span class="hps">active / passive</span><span> mode?</span> <span class="hps atn">(</span><span>I read that</span> <span class="hps">stonith</span> <span class="hps">is</span> most <span class="hps">used</span> <span class="hps">for</span> <span class="hps">active /</span> <span class="hps">active</span> <span class="hps">clusters</span><span class="">)</span></span>.<br>


<br><span id="result_box" class="" lang="en"><span class="hps">Anyway</span><span>, thank you</span> <span class="hps">for these advices, this</span> <span class="hps">is</span> <span class="hps">much appreciated</span></span>!<br>


<br><br><div class="gmail_quote">2012/6/26 Andreas Kurz <span dir="ltr">&lt;<a href="mailto:andreas@hastexo.com" target="_blank">andreas@hastexo.com</a>&gt;</span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">


<div class="im">On 06/26/2012 03:49 PM, coma wrote:<br>

&gt; Hello,<br>

&gt;<br>

&gt; i running on a 2 node cluster with corosync &amp; drbd in active/passive<br>

&gt; mode for mysql hight availablity.<br>

&gt;<br>

&gt; The cluster working fine (failover/failback &amp; replication ok), i have no<br>

&gt; network outage (network is monitored and i&#39;ve not seen any failure) but<br>

&gt; split-brain occurs very often and i don&#39;t anderstand why, maybe you can<br>

&gt; help me?<br>

<br>

</div>Are the nodes virtual machines or have a high load from time to time?<br>

<div class="im"><br>

&gt;<br>

&gt; I&#39;m new pacemaker/corosync/DRBD user, so my cluster and drbd<br>

&gt; configuration are probably not optimal, so if you have any comments,<br>

&gt; tips or examples I would be very grateful!<br>

&gt;<br>

&gt; Here is an exemple of corosync log when a split-brain occurs (1 hour log<br>

&gt; to see before/after split-brain):<br>

&gt;<br>

&gt; <a href="http://pastebin.com/3DprkcTA" target="_blank">http://pastebin.com/3DprkcTA</a><br>

<br>

</div>Increase your token value in corosync.conf to a higher value ... like<br>

10s, configure resource-level fencing in DRBD and setup STONITH for your<br>

cluster and use redundant corosync rings.<br>

<br>

Regards,<br>

Andreas<br>

<br>

--<br>

Need help with Pacemaker?<br>

<a href="http://www.hastexo.com/now" target="_blank">http://www.hastexo.com/now</a><br>

<div class="im"><br>

&gt;<br>

&gt; Thank you in advance for any help!<br>

&gt;<br>

&gt;<br>

&gt; More details about my configuration:<br>

&gt;<br>

&gt; I have:<br>

&gt; One prefered &quot;master&quot; node (node1) on a virtual server, and one &quot;slave&quot;<br>

&gt; node on a physical server.<br>

&gt; On each server,<br>

&gt; eth0 is connected on my main LAN for client/server communication (with<br>

&gt; cluster VIP)<br>

&gt; Eth1 is connected on a dedicated Vlan for corosync communication<br>

&gt; (network: 192.168.3.0 /30)<br>

&gt; Eth2 is connected on a dedicated Vlan for drbd replication (network:<br>

</div>&gt; <a href="http://192.168.2.0/30" target="_blank">192.168.2.0/30</a> &lt;<a href="http://192.168.2.0/30" target="_blank">http://192.168.2.0/30</a>&gt;)<br>

<div class="im">&gt;<br>

&gt; Here is my drbd configuration:<br>

&gt;<br>

&gt;<br>

&gt; resource drbd-mysql {<br>

&gt; protocol C;<br>

&gt;     disk {<br>

&gt;         on-io-error detach;<br>

&gt;     }<br>

&gt;     handlers {<br>

&gt;         fence-peer &quot;/usr/lib/drbd/crm-fence-peer.sh&quot;;<br>

&gt;         after-resync-target &quot;/usr/lib/drbd/crm-unfence-peer.sh&quot;;<br>

&gt;         split-brain &quot;/usr/lib/drbd/notify-split-brain.sh root&quot;;<br>

&gt;     }<br>

&gt;     net {<br>

&gt;         cram-hmac-alg sha1;<br>

&gt;         shared-secret &quot;secret&quot;;<br>

&gt;         after-sb-0pri discard-younger-primary;<br>

&gt;         after-sb-1pri discard-secondary;<br>

&gt;         after-sb-2pri call-pri-lost-after-sb;<br>

&gt;     }<br>

&gt;     startup {<br>

&gt;         wfc-timeout  1;<br>

&gt;         degr-wfc-timeout 1;<br>

&gt;     }<br>

&gt;     on node1{<br>

&gt;         device /dev/drbd1;<br>

</div>&gt;         address <a href="http://192.168.2.1:7801" target="_blank">192.168.2.1:7801</a> &lt;<a href="http://192.168.2.1:7801" target="_blank">http://192.168.2.1:7801</a>&gt;;<br>

<div class="im">&gt;         disk /dev/sdb;<br>

&gt;         meta-disk internal;<br>

&gt;     }<br>

&gt;     on node2 {<br>

&gt;     device /dev/drbd1;<br>

</div>&gt;     address <a href="http://192.168.2.2:7801" target="_blank">192.168.2.2:7801</a> &lt;<a href="http://192.168.2.2:7801" target="_blank">http://192.168.2.2:7801</a>&gt;;<br>

<div><div class="h5">&gt;     disk /dev/sdb;<br>

&gt;     meta-disk internal;<br>

&gt;     }<br>

&gt; }<br>

&gt;<br>

&gt;<br>

&gt; Here my cluster config:<br>

&gt;<br>

&gt; node node1 \<br>

&gt;         attributes standby=&quot;off&quot;<br>

&gt; node node2 \<br>

&gt;         attributes standby=&quot;off&quot;<br>

&gt; primitive Cluster-VIP ocf:heartbeat:IPaddr2 \<br>

&gt;         params ip=&quot;10.1.0.130&quot; broadcast=&quot;10.1.7.255&quot; nic=&quot;eth0&quot;<br>

&gt; cidr_netmask=&quot;21&quot; iflabel=&quot;VIP1&quot; \<br>

&gt;         op monitor interval=&quot;10s&quot; timeout=&quot;20s&quot; \<br>

&gt;         meta is-managed=&quot;true&quot;<br>

&gt; primitive cluster_status_page ocf:heartbeat:ClusterMon \<br>

&gt;         params pidfile=&quot;/var/run/crm_mon.pid&quot;<br>

&gt; htmlfile=&quot;/var/www/html/cluster_status.html&quot; \<br>

&gt;         op monitor interval=&quot;4s&quot; timeout=&quot;20s&quot;<br>

&gt; primitive datavg ocf:heartbeat:LVM \<br>

&gt;         params volgrpname=&quot;datavg&quot; exclusive=&quot;true&quot; \<br>

&gt;         op start interval=&quot;0&quot; timeout=&quot;30&quot; \<br>

&gt;         op stop interval=&quot;0&quot; timeout=&quot;30&quot;<br>

&gt; primitive drbd_mysql ocf:linbit:drbd \<br>

&gt;         params drbd_resource=&quot;drbd-mysql&quot; \<br>

&gt;         op monitor interval=&quot;15s&quot;<br>

&gt; primitive fs_mysql ocf:heartbeat:Filesystem \<br>

&gt;         params device=&quot;/dev/datavg/data&quot; directory=&quot;/data&quot; fstype=&quot;ext4&quot;<br>

&gt; primitive mail_alert ocf:heartbeat:MailTo \<br>

</div></div>&gt;         params email=&quot;<a href="mailto:myemail@test.com">myemail@test.com</a> &lt;mailto:<a href="mailto:myemail@test.com">myemail@test.com</a>&gt;&quot; \<br>

<div class="HOEnZb"><div class="h5">&gt;         op monitor interval=&quot;10&quot; timeout=&quot;10&quot; depth=&quot;0&quot;<br>

&gt; primitive mysqld ocf:heartbeat:mysql \<br>

&gt;         params binary=&quot;/usr/bin/mysqld_safe&quot; config=&quot;/etc/my.cnf&quot;<br>

&gt; datadir=&quot;/data/mysql/databases&quot; user=&quot;mysql&quot;<br>

&gt; pid=&quot;/var/run/mysqld/mysqld.pid&quot; socket=&quot;/var/lib/mysql/mysql.sock&quot;<br>

&gt; test_passwd=&quot;cluster_test&quot; test_table=&quot;Cluster_Test.dbcheck&quot;<br>

&gt; test_user=&quot;cluster_test&quot; \<br>

&gt;         op start interval=&quot;0&quot; timeout=&quot;120&quot; \<br>

&gt;         op stop interval=&quot;0&quot; timeout=&quot;120&quot; \<br>

&gt;         op monitor interval=&quot;30s&quot; timeout=&quot;30s&quot; OCF_CHECK_LEVEL=&quot;1&quot;<br>

&gt; target-role=&quot;Started&quot;<br>

&gt; group mysql datavg fs_mysql Cluster-VIP mysqld cluster_status_page<br>

&gt; mail_alert<br>

&gt; ms ms_drbd_mysql drbd_mysql \<br>

&gt;         meta master-max=&quot;1&quot; master-node-max=&quot;1&quot; clone-max=&quot;2&quot;<br>

&gt; clone-node-max=&quot;1&quot; notify=&quot;true&quot;<br>

&gt; location mysql-preferred-node mysql inf: node1<br>

&gt; colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master<br>

&gt; order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start<br>

&gt; property $id=&quot;cib-bootstrap-options&quot; \<br>

&gt;         dc-version=&quot;1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558&quot; \<br>

&gt;         cluster-infrastructure=&quot;openais&quot; \<br>

&gt;         expected-quorum-votes=&quot;2&quot; \<br>

&gt;         stonith-enabled=&quot;false&quot; \<br>

&gt;         no-quorum-policy=&quot;ignore&quot; \<br>

&gt;         last-lrm-refresh=&quot;1340701656&quot;<br>

&gt; rsc_defaults $id=&quot;rsc-options&quot; \<br>

&gt;         resource-stickiness=&quot;100&quot; \<br>

&gt;         migration-threshold=&quot;2&quot; \<br>

&gt;         failure-timeout=&quot;30s&quot;<br>

&gt;<br>

&gt;<br>

</div></div><div class="HOEnZb"><div class="h5">&gt; _______________________________________________<br>

&gt; Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

&gt; <a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

&gt;<br>

&gt; Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

&gt; Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

&gt; Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

&gt;<br>

<br>

<br>

<br>

<br>

</div></div><br>_______________________________________________<br>

Pacemaker mailing list: <a href="mailto:Pacemaker@oss.clusterlabs.org">Pacemaker@oss.clusterlabs.org</a><br>

<a href="http://oss.clusterlabs.org/mailman/listinfo/pacemaker" target="_blank">http://oss.clusterlabs.org/mailman/listinfo/pacemaker</a><br>

<br>

Project Home: <a href="http://www.clusterlabs.org" target="_blank">http://www.clusterlabs.org</a><br>

Getting started: <a href="http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf" target="_blank">http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf</a><br>

Bugs: <a href="http://bugs.clusterlabs.org" target="_blank">http://bugs.clusterlabs.org</a><br>

<br></blockquote></div><br>