[Pacemaker] MySQL, Percona replication manager - split brain

Sun Oct 26 06:32:29 UTC 2014

В Sat, 25 Oct 2014 23:34:54 +0300
Andrew <nitr0 at seti.kr.ua> пишет:

> 25.10.2014 22:34, Digimer пишет:
> > On 25/10/14 03:32 PM, Andrew wrote:
> >> Hi all.
> >>
> >> I use Percona as RA on cluster (nothing mission-critical, currently -
> >> just zabbix data); today after restarting MySQL resource (crm resource
> >> restart p_mysql) I've got a split brain state - MySQL for some reason
> >> started first at ex-slave node, ex-master starts later (possibly I've
> >> set too small timeout to shutdown - only 120s, but I'm not sure).
> >>
> >> After restart resource on both nodes it seems like mysql replication was
> >> ok - but then after ~50min it fails in split brain again for unknown
> >> reason (no resource restart was noticed).
> >>
> >> In 'show replication status' there is an error in table caused by unique
> >> index dup.
> >>
> >> So I have a questions:
> >> 1) Which thing causes split brain, and how to avoid it in future?
> >
> > Cause:
> >
> > Logs?
> ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State 
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC 
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss 
> of CCM Quorum: Ignore
> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation 
> monitor found resource p_pgsql:0 active in master mode on node1.cluster
> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation 
> monitor found resource p_mysql:1 active in master mode on node2.cluster

That seems too late. The real cause is that resource was reported as
being in master state on both nodes and this happened earlier.
> 
> >
> > Prevent:
> >
> > Fencing (aka stonith). This is why fencing is required.
> No node failure. Just daemon was restarted.
> 

"Split brain" == loss of communication. It does not matter whether
communication was lost because node failed or because daemon was not
running. There is no way for surviving node to know, *why*
communication was lost.

> >
> >> 2) How to resolve split brain state? Is it enough just to wait for
> >> failure, then - restart mysql by hand and clean row with dup index in
> >> slave db, and then run resource again? Or there is some automation for
> >> such cases?
> >
> > How are you sharing data? Can you give us a better understanding of 
> > your setup?
> >
> Semi-synchronous MySQL replication, if you mean sharing DB log between 
> nodes.
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org