[Pacemaker] MySQL, Percona replication manager - split brain

Sun Oct 26 08:51:13 UTC 2014

26.10.2014 08:32, Andrei Borzenkov пишет:
> В Sat, 25 Oct 2014 23:34:54 +0300
> Andrew <nitr0 at seti.kr.ua> пишет:
>
>> 25.10.2014 22:34, Digimer пишет:
>>> On 25/10/14 03:32 PM, Andrew wrote:
>>>> Hi all.
>>>>
>>>> I use Percona as RA on cluster (nothing mission-critical, currently -
>>>> just zabbix data); today after restarting MySQL resource (crm resource
>>>> restart p_mysql) I've got a split brain state - MySQL for some reason
>>>> started first at ex-slave node, ex-master starts later (possibly I've
>>>> set too small timeout to shutdown - only 120s, but I'm not sure).
>>>>
>>>> After restart resource on both nodes it seems like mysql replication was
>>>> ok - but then after ~50min it fails in split brain again for unknown
>>>> reason (no resource restart was noticed).
>>>>
>>>> In 'show replication status' there is an error in table caused by unique
>>>> index dup.
>>>>
>>>> So I have a questions:
>>>> 1) Which thing causes split brain, and how to avoid it in future?
>>> Cause:
>>>
>>> Logs?
>> ct 25 13:54:13 node2 crmd[29248]:   notice: do_state_transition: State
>> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
>> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_config: On loss
>> of CCM Quorum: Ignore
>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
>> monitor found resource p_pgsql:0 active in master mode on node1.cluster
>> Oct 25 13:54:13 node2 pengine[29247]:   notice: unpack_rsc_op: Operation
>> monitor found resource p_mysql:1 active in master mode on node2.cluster
> That seems too late. The real cause is that resource was reported as
> being in master state on both nodes and this happened earlier.
This is a different resources (pgsql and mysql)/

>>> Prevent:
>>>
>>> Fencing (aka stonith). This is why fencing is required.
>> No node failure. Just daemon was restarted.
>>
> "Split brain" == loss of communication. It does not matter whether
> communication was lost because node failed or because daemon was not
> running. There is no way for surviving node to know, *why*
> communication was lost.
>
So how stonith will help in this case? Daemon will be restarted after 
it's death if it occures during restart, and stonith will see alive 
daemon...

So what is the easiest split-brain solution? Just to stop daemons, and 
copy all mysql data from good node to bad one?