[ClusterLabs] Antw: Re: reproducible split brain

Mon Mar 21 09:27:52 CET 2016

On 19/03/16 15:43, Digimer wrote:
> On 19/03/16 10:10 AM, Dennis Jacobfeuerborn wrote:
>> On 18.03.2016 00:50, Digimer wrote:
>>> On 17/03/16 07:30 PM, Christopher Harvey wrote:
>>>> On Thu, Mar 17, 2016, at 06:24 PM, Ken Gaillot wrote:
>>>>> On 03/17/2016 05:10 PM, Christopher Harvey wrote:
>>>>>> If I ignore pacemaker's existence, and just run corosync, corosync
>>>>>> disagrees about node membership in the situation presented in the first
>>>>>> email. While it's true that stonith just happens to quickly correct the
>>>>>> situation after it occurs it still smells like a bug in the case where
>>>>>> corosync in used in isolation. Corosync is after all a membership and
>>>>>> total ordering protocol, and the nodes in the cluster are unable to
>>>>>> agree on membership.
>>>>>>
>>>>>> The Totem protocol specifies a ring_id in the token passed in a ring.
>>>>>> Since all of the 3 nodes but one have formed a new ring with a new id
>>>>>> how is it that the single node can survive in a ring with no other
>>>>>> members passing a token with the old ring_id?
>>>>>>
>>>>>> Are there network failure situations that can fool the Totem membership
>>>>>> protocol or is this an implementation problem? I don't see how it could
>>>>>> not be one or the other, and it's bad either way.
>>>>>
>>>>> Neither, really. In a split brain situation, there simply is not enough
>>>>> information for any protocol or implementation to reliably decide what
>>>>> to do. That's what fencing is meant to solve -- it provides the
>>>>> information that certain nodes are definitely not active.
>>>>>
>>>>> There's no way for either side of the split to know whether the opposite
>>>>> side is down, or merely unable to communicate properly. If the latter,
>>>>> it's possible that they are still accessing shared resources, which
>>>>> without proper communication, can lead to serious problems (e.g. data
>>>>> corruption of a shared volume).
>>>>
>>>> The totem protocol is silent on the topic of fencing and resources, much
>>>> the way TCP is.
>>>>
>>>> Please explain to me what needs to be fenced in a cluster without
>>>> resources where membership and total message ordering are the only
>>>> concern. If fencing were a requirement for membership and ordering,
>>>> wouldn't stonith be part of corosync and not pacemaker?
>>>
>>> Corosync is a membership and communication layer (and in v2+, a quorum
>>> provider). It doesn't care about or manage anything higher up. So it
>>> doesn't care about fencing itself.
>>>
>>> It simply cares about;
>>>
>>> * Who is in the cluster?
>>> * How do the members communicate?
>>> * (v2+) Is there enough members for quorum?
>>> * Notify resource managers of membership changes (join or loss).
>>>
>>> The resource manager, pacemaker or rgmanager, care about resources, so
>>> it is what cares about making smart decisions. As Ken pointed out,
>>> without fencing, it can never tell the difference between no access and
>>> dead peer.
>>>
>>> This is (again) why fencing is critical.
>>
>> I think the key issue here is that people think about corosync they
>> believe there can only be two state for membership (true or false) when
>> in reality there are three possible states: true, false and unknown.
>>

Not really. As far as corosync is concerned 'unknown' and 'down' are the
same thing - it can't differentiate between them. That's the reason that
quorum exists. If says that if there are enough nodes in the 'up' state
then we can proceed, if not we can't because we don't have enough
knowledge about the cluster node set. Fencing is then the layer above
that to make sure that the minority of nodes are returned to a known
state so that they can be returned cleanly to the cluster and contribute
back to quorum.

>> The problem then is that corosync apparently has no built-in way to deal
>> with the "unknown" situation and requires guidance from an external
>> entity for that (in this case pacemakers fencing).
>>
>> This means that corosync alone simply cannot give you reliable
>> membership guarantees. I strictly requires external help to be able to
>> provide that.
>>
>> Regards,
>>   Dennis
> 
> I'm not sure that is accurate.
> 
> If corosync declares a node lost (failed to receive X tokens in Y time),
> the node is declared lost and it reforms a new cluster, without the lost
> member. So from corosync's perspective, the lost node is no longer a
> member (it won't receive messages). It is possible that the lost node
> might itself be alive, in which case it's corosync will do the same
> thing (reform a new cluster, possibly with itself as the sole member).
> 
> If you're trying to have corosync *do* something, then that is missing
> the point of corosync, I think. In all cases I've ever seen, you need a
> separate resource manager to actually react to the membership changes.

Pretty much correct. corosync provides higher layers with a (sub)set of
nodes that can or cannot be allowed to run. It's up to those higher
layers to make the decision whether running services is a sensible
decision give the nodes available and the likely state of the others.
Pacemaker can differentiate between "unknown" and "down" because it
initiates and checks the success (or otherwise) of the fencing operation.

Yes, you can run a cluster without fencing/pacemaker etc. But to do that
you would need to either make sure that you didn't keep *any* state on
the nodes that could get corrupted should a node go rogue, or
re-implement the higher layers of pacemaker et al, so that corruption
cannot occur.

Chrissie