[Pacemaker] split brain - after network recovery - resources can still be migrated

Sun Oct 26 13:06:04 EDT 2014

On 26/10/14 12:32 PM, Andrei Borzenkov wrote:
> В Sun, 26 Oct 2014 12:01:03 +0100
> Vladimir <ml at foomx.de> пишет:
>
>> On Sat, 25 Oct 2014 19:11:02 -0400
>> Digimer <lists at alteeve.ca> wrote:
>>
>>> On 25/10/14 06:35 PM, Vladimir wrote:
>>>> On Sat, 25 Oct 2014 17:30:07 -0400
>>>> Digimer <lists at alteeve.ca> wrote:
>>>>
>>>>> On 25/10/14 05:09 PM, Vladimir wrote:
>>>>>> Hi,
>>>>>>
>>>>>> currently I'm testing a 2 node setup using ubuntu trusty.
>>>>>>
>>>>>> # The scenario:
>>>>>>
>>>>>> All communication links betwenn the 2 nodes are cut off. This
>>>>>> results in a split brain situation and both nodes take their
>>>>>> resources online.
>>>>>>
>>>>>> When the communication links get back, I see following behaviour:
>>>>>>
>>>>>> On drbd level the split brain is detected and the device is
>>>>>> disconnected on both nodes because of "after-sb-2pri disconnect"
>>>>>> and then it goes to StandAlone ConnectionState.
>>>>>>
>>>>>> I'm wondering why pacemaker does not let the resources fail.
>>>>>> It is still possible to migrate resources between the nodes
>>>>>> although they're in StandAlone ConnectionState. After a split
>>>>>> brain that's not what I want.
>>>>>>
>>>>>> Is this the expected behaviour? Is it possible to let the
>>>>>> resources fail after the network recovery to avoid fürther data
>>>>>> corruption.
>>>>>>
>>>>>> (At the moment I can't use resource or node level fencing in my
>>>>>> setup.)
>>>>>>
>>>>>> Here the main part of my config:
>>>>>>
>>>>>> #> dpkg -l | awk '$2 ~ /^(pacem|coro|drbd|libqb)/{print $2,$3}'
>>>>>> corosync 2.3.3-1ubuntu1
>>>>>> drbd8-utils 2:8.4.4-1ubuntu1
>>>>>> libqb-dev 0.16.0.real-1ubuntu3
>>>>>> libqb0 0.16.0.real-1ubuntu3
>>>>>> pacemaker 1.1.10+git20130802-1ubuntu2.1
>>>>>> pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.1
>>>>>>
>>>>>> # pacemaker
>>>>>> primitive drbd-mysql ocf:linbit:drbd \
>>>>>> params drbd_resource="mysql" \
>>>>>> op monitor interval="29s" role="Master" \
>>>>>> op monitor interval="30s" role="Slave"
>>>>>>
>>>>>> ms ms-drbd-mysql drbd-mysql \
>>>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>>>> clone-node-max="1" notify="true"
>>>>>
>>>>> Split-brains are prevented by using reliable fencing (aka stonith).
>>>>> You configure stonith in pacemaker (using IPMI/iRMC/iLO/etc,
>>>>> switched PDUs, etc). Then you configure DRBD to use the
>>>>> crm-fence-peer.sh fence-handler and you set the fencing policy to
>>>>> 'resource-and-stonith;'.
>>>>>
>>>>> This way, if all links fail, both nodes block and call a fence. The
>>>>> faster one fences (powers off) the slower, and then it begins
>>>>> recovery, assured that the peer is not doing the same.
>>>>>
>>>>> Without stonith/fencing, then there is no defined behaviour. You
>>>>> will get split-brains and that is that. Consider; Both nodes lose
>>>>> contact with it's peer. Without fencing, both must assume the peer
>>>>> is dead and thus take over resources.
>>>>
>>>> That split brains can occur in such a setup that's clear. But I
>>>> would expect pacemaker to stop the drbd resource when the link
>>>> between the cluster nodes is reestablished instead of continue
>>>> running it.
>>>
>>> DRBD will refuse to reconnect until it is told which node's data to
>>> delete. This is data loss and can not be safely automated.
>>
>> Sorry if maybe described it unclear but I don't want pacemaker to do an
>> automatic split brain recovery. That would not make any sense to me
>> either. This decission has to be taken by an administrator.
>>
>> But is it possible to configure pacemaker to do the following?
>>
>> - if there are 2 Nodes which can see and communicate with each other
>>    AND
>> - if their disk state is not UpToDate/UpToDate (typically after a split
>>    brain)
>> - then let drbd resource fail because something is obviously broken and
>>    an administrator has to decide how to continue.
>>
>
> This would require resource agent support. But it looks like current
> resource agent relies on fencing to resolve split brain situation. As
> long as resource agent itself does not indicate resource failure,
> there is nothing pacemaker can do.
>
>
>>>>> This is why stonith is required in clusters. Even with quorum, you
>>>>> can't assume anything about the state of the peer until it is
>>>>> fenced, so it would only give you a false sense of security.
>>>>
>>>> Maybe I'll can use resource level fencing.
>>>
>>> You need node-level fencing.
>>
>> I know node level fencing is more secure. But shouldn't resource level
>> fencing also work here? e.g.
>> (http://www.drbd.org/users-guide/s-pacemaker-fencing.html)
>>
>> Currently I can't use ipmi, apc switches or a shared storage device
>> for fencing at most fencing via ssh. But what I've read this is also not
>> recommended for production setups.
>>
>
> You could try meatware stonith agent. This does exactly what you want -
> it freezes further processing unless administrator manually declares one
> node as down.

Support for this was dropped in the Red Hat world back in 2010ish 
because it was so easy for a panicy admin to clear the fence without 
adequately ensuring the peer node had actually been turned off. I 
*strongly* recommend against using manual fencing. If your nodes have 
IPMI, use that. It's super well tested.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?