[ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups

Mon Jul 25 11:36:28 EDT 2016

On 25/07/16 16:27, Klaus Wenninger wrote:
> On 07/25/2016 04:56 PM, Thomas Lamprecht wrote:
>> Thanks for the fast reply :)
>>
>>
>> On 07/25/2016 03:51 PM, Christine Caulfield wrote:
>>> On 25/07/16 14:29, Thomas Lamprecht wrote:
>>>> Hi all,
>>>>
>>>> I'm currently testing the new features of corosync 2.4, especially
>>>> qdevices.
>>>> First tests show quite nice results, like having quorum on a single
>>>> node
>>>> left out of a three node cluster.
>>>>
>>>> But what I'm a bit worrying about is what happens if the server where
>>>> qnetd runs, or the qdevice daemon fails,
>>>> in this case the cluster cannot afford any other loss of a node in my
>>>> three node setup as votes expected are
>>>> 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd
>>>> does not run run or failed.
>>> We're looking into ways of making this more resilient. It might be
>>> possible to cluster a qnetd (though this is not currently supported) in
>>> a separate cluster from the arbitrated one, obviously.
>>
>> Yeah I saw that in the QDevice document, that would be a way.
>>
>> Would then the qnetd daemons act like an own cluster I guess, as there
>> would be a need to communicate which node sees which qnetd daemon?
>> So that a decision about the quorate partition can be made.
>>
>> But it's always binding the reliability of a cluster to the one of a
>> node, adding a dependency,
>> meaning that now failures of components outside from the cluster,
>> which would else have
>> no affect on the cluster behaviour may now affect it, which could be a
>> problem?
>>
>> I know that's worst case scenario but with only one qnetd running on a
>> single (external) node
>> it can happen, and if the reliability of the node running qnetd is the
>> same as the one from each cluster node
>> the reliability of the whole cluster in a three node case would be
>> quite simplified, if I remember my introduction course to this topic
>> somewhat correctly:
>>
>> Without qnetd: 1 - ( (1 - R1) *  (1 - R2) * (1 - R3))
>>
>> With qnetd: (1 - ( (1 - R1) *  (1 - R2) * (1 - R3)) ) * Rqnetd
>>
>> Where R1, R2, R3 are the reliabilities of  the cluster nodes and
>> Rqnetd is the reliability of the node running qnetd.
>> While thats a really really simplified model, not quite correctly
>> depict reallity, the base concept that the reliability
>> of the whole cluster gets dependent of the one from the node running
>> qnetd, or?
>>
> With lms and ffsplit I guess the calculation is not that simple anymore ...
> 
> correct me if I'm wrong but I think a bottomline to understanding the
> benefits of qdevice is to think of the classic quorum-generation taking
> basically a snapshot of the situation at a certain time and deriving the
> reactions from that - whereas with qdevice it is tried to benefit from
> the knowledge of the past (respectively how we got into the current
> situation).
>  

Actually no. qdevice is totally stateless (which is why I think it will
cluster well when we get round to it). it makes the best decision it can
based on the fact that it should have a full view of all nodes in the
cluster regardless of whether they can see each other - and if they
can't see qdevice then they don't get the vote anyway.

Chrissie