[Pacemaker] Multiple split-brain problem

Wed Jun 27 06:04:09 EDT 2012

Thank for your reply Andreas,

My fisrt node is a virtual machine (active node), the second (passive node)
is physical standalone server, there is no high load on any of them but the
problem seems to come from the virtual server.
I actually have the same problem of split brain when I take  or delete
a virtual
machine snapshot (network connection is lost for a few moment, maybe about
1s). But i take snapshot only once a week, and I have split brain several
times in a week.
I didn't detect any other loss of connection, or perhaps it is micro network
cuts that are not detected by my monitoring system (and I have no problem
with my nonclustered services).
In case of microcuts, i think the problem is DRBD, is it too sensitive? can
i adjust values to avoid the problem?

I will try increase my token value to 10000 / consensus to 12000 and
configure resource-level fencing in DRBD, thanks for the tips.

About redundant rings, I read on the DRBD documentation that it is vital
for the resource level fencing, but can i do without?
Because i use a virtual server (my virtual servers are on a blade) i can't
have "physical" link between the 2 nodes (cable between the 2 nodes), so i
use "virtual links" (with vlan to separate them from my main network). I
can create a 2nd corosync link but I doubt its usefulness, if something
goes wrong with the first link, I think i would have the same problem on
the second. Although they are virtually separated, they use the same
physical hardware (All my hardware is redondant therefore link problems are
very limited).
But maybe i've wrong, I'll think about it.

About stonith, I will read the documentation, but is it really useful to
get out the "big artillery" for a simple 2-node cluster in active /
passivemode?
(I read that stonith is most used for active / active clusters).

Anyway, thank you for these advices, this is much appreciated!

2012/6/26 Andreas Kurz <andreas at hastexo.com>

> On 06/26/2012 03:49 PM, coma wrote:
> > Hello,
> >
> > i running on a 2 node cluster with corosync & drbd in active/passive
> > mode for mysql hight availablity.
> >
> > The cluster working fine (failover/failback & replication ok), i have no
> > network outage (network is monitored and i've not seen any failure) but
> > split-brain occurs very often and i don't anderstand why, maybe you can
> > help me?
>
> Are the nodes virtual machines or have a high load from time to time?
>
> >
> > I'm new pacemaker/corosync/DRBD user, so my cluster and drbd
> > configuration are probably not optimal, so if you have any comments,
> > tips or examples I would be very grateful!
> >
> > Here is an exemple of corosync log when a split-brain occurs (1 hour log
> > to see before/after split-brain):
> >
> > http://pastebin.com/3DprkcTA
>
> Increase your token value in corosync.conf to a higher value ... like
> 10s, configure resource-level fencing in DRBD and setup STONITH for your
> cluster and use redundant corosync rings.
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > Thank you in advance for any help!
> >
> >
> > More details about my configuration:
> >
> > I have:
> > One prefered "master" node (node1) on a virtual server, and one "slave"
> > node on a physical server.
> > On each server,
> > eth0 is connected on my main LAN for client/server communication (with
> > cluster VIP)
> > Eth1 is connected on a dedicated Vlan for corosync communication
> > (network: 192.168.3.0 /30)
> > Eth2 is connected on a dedicated Vlan for drbd replication (network:
> > 192.168.2.0/30 <http://192.168.2.0/30>)
> >
> > Here is my drbd configuration:
> >
> >
> > resource drbd-mysql {
> > protocol C;
> >     disk {
> >         on-io-error detach;
> >     }
> >     handlers {
> >         fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> >         after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> >         split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> >     }
> >     net {
> >         cram-hmac-alg sha1;
> >         shared-secret "secret";
> >         after-sb-0pri discard-younger-primary;
> >         after-sb-1pri discard-secondary;
> >         after-sb-2pri call-pri-lost-after-sb;
> >     }
> >     startup {
> >         wfc-timeout  1;
> >         degr-wfc-timeout 1;
> >     }
> >     on node1{
> >         device /dev/drbd1;
> >         address 192.168.2.1:7801 <http://192.168.2.1:7801>;
> >         disk /dev/sdb;
> >         meta-disk internal;
> >     }
> >     on node2 {
> >     device /dev/drbd1;
> >     address 192.168.2.2:7801 <http://192.168.2.2:7801>;
> >     disk /dev/sdb;
> >     meta-disk internal;
> >     }
> > }
> >
> >
> > Here my cluster config:
> >
> > node node1 \
> >         attributes standby="off"
> > node node2 \
> >         attributes standby="off"
> > primitive Cluster-VIP ocf:heartbeat:IPaddr2 \
> >         params ip="10.1.0.130" broadcast="10.1.7.255" nic="eth0"
> > cidr_netmask="21" iflabel="VIP1" \
> >         op monitor interval="10s" timeout="20s" \
> >         meta is-managed="true"
> > primitive cluster_status_page ocf:heartbeat:ClusterMon \
> >         params pidfile="/var/run/crm_mon.pid"
> > htmlfile="/var/www/html/cluster_status.html" \
> >         op monitor interval="4s" timeout="20s"
> > primitive datavg ocf:heartbeat:LVM \
> >         params volgrpname="datavg" exclusive="true" \
> >         op start interval="0" timeout="30" \
> >         op stop interval="0" timeout="30"
> > primitive drbd_mysql ocf:linbit:drbd \
> >         params drbd_resource="drbd-mysql" \
> >         op monitor interval="15s"
> > primitive fs_mysql ocf:heartbeat:Filesystem \
> >         params device="/dev/datavg/data" directory="/data" fstype="ext4"
> > primitive mail_alert ocf:heartbeat:MailTo \
> >         params email="myemail at test.com <mailto:myemail at test.com>" \
> >         op monitor interval="10" timeout="10" depth="0"
> > primitive mysqld ocf:heartbeat:mysql \
> >         params binary="/usr/bin/mysqld_safe" config="/etc/my.cnf"
> > datadir="/data/mysql/databases" user="mysql"
> > pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysql.sock"
> > test_passwd="cluster_test" test_table="Cluster_Test.dbcheck"
> > test_user="cluster_test" \
> >         op start interval="0" timeout="120" \
> >         op stop interval="0" timeout="120" \
> >         op monitor interval="30s" timeout="30s" OCF_CHECK_LEVEL="1"
> > target-role="Started"
> > group mysql datavg fs_mysql Cluster-VIP mysqld cluster_status_page
> > mail_alert
> > ms ms_drbd_mysql drbd_mysql \
> >         meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
> > location mysql-preferred-node mysql inf: node1
> > colocation mysql_on_drbd inf: mysql ms_drbd_mysql:Master
> > order mysql_after_drbd inf: ms_drbd_mysql:promote mysql:start
> > property $id="cib-bootstrap-options" \
> >
> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
> >         cluster-infrastructure="openais" \
> >         expected-quorum-votes="2" \
> >         stonith-enabled="false" \
> >         no-quorum-policy="ignore" \
> >         last-lrm-refresh="1340701656"
> > rsc_defaults $id="rsc-options" \
> >         resource-stickiness="100" \
> >         migration-threshold="2" \
> >         failure-timeout="30s"
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120627/b1f837c2/attachment-0003.html>