[ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups

Mon Jul 25 10:56:40 EDT 2016

Thanks for the fast reply :)

On 07/25/2016 03:51 PM, Christine Caulfield wrote:
> On 25/07/16 14:29, Thomas Lamprecht wrote:
>> Hi all,
>>
>> I'm currently testing the new features of corosync 2.4, especially
>> qdevices.
>> First tests show quite nice results, like having quorum on a single node
>> left out of a three node cluster.
>>
>> But what I'm a bit worrying about is what happens if the server where
>> qnetd runs, or the qdevice daemon fails,
>> in this case the cluster cannot afford any other loss of a node in my
>> three node setup as votes expected are
>> 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd
>> does not run run or failed.
> We're looking into ways of making this more resilient. It might be
> possible to cluster a qnetd (though this is not currently supported) in
> a separate cluster from the arbitrated one, obviously.

Yeah I saw that in the QDevice document, that would be a way.

Would then the qnetd daemons act like an own cluster I guess, as there 
would be a need to communicate which node sees which qnetd daemon?
So that a decision about the quorate partition can be made.

But it's always binding the reliability of a cluster to the one of a 
node, adding a dependency,
meaning that now failures of components outside from the cluster, which 
would else have
no affect on the cluster behaviour may now affect it, which could be a 
problem?

I know that's worst case scenario but with only one qnetd running on a 
single (external) node
it can happen, and if the reliability of the node running qnetd is the 
same as the one from each cluster node
the reliability of the whole cluster in a three node case would be quite 
simplified, if I remember my introduction course to this topic somewhat 
correctly:

Without qnetd: 1 - ( (1 - R1) *  (1 - R2) * (1 - R3))

With qnetd: (1 - ( (1 - R1) *  (1 - R2) * (1 - R3)) ) * Rqnetd

Where R1, R2, R3 are the reliabilities of  the cluster nodes and Rqnetd 
is the reliability of the node running qnetd.
While thats a really really simplified model, not quite correctly depict 
reallity, the base concept that the reliability
of the whole cluster gets dependent of the one from the node running 
qnetd, or?

>
> The LMS algorithm is quite smart about how it doles out its vote and can
> handle isolation from the main qnetd provided that the main core of the
> cluster (the majority in a split) retains quorum, but any more serious
> changes to the cluster config will cause it to be withdrawn. So in this
> case you should find that your 3 node cluster will continue to work in
> the absence of the qnetd server or link, provided you don't lose any nodes.

Yes I read that in the documents and saw that during testing also, 
really good work!

My point of my mail was exactly the failure of qnetd itself and the 
resulting situation that the cluster
then cannot afford to loose any node, while without qnetd it could 
afford to loose (n - 1) / 2 nodes.

Or do I have also to enable quorum.last_man_standing together with 
quorum.wait_for_all to allowing down scaling of the expected votes if 
qnetd fails completely?
I will test that.

I'm just wanting to be sure if my thoughts are correct, or at least not 
completely flawed and that
qnetd like it is makes sense in a even node count cluster with the 
ffsplit algorithm but not in an
uneven node count cluster, if the reliability of the node running qnetd 
cannot be guaranteed,
i.e. adding HA to the service (VM or container) running qnetd.

best regards,
Thomas

>
> In a 3 node setup obviously LMS is more appropriate than ffsplit anyway.
>
> Chrissie
>
>> So in this case I'm bound to the reliability of the server providing the
>> qnetd service,
>> if it fails I cannot afford to loose any other node in my three node
>> example,
>> or also in any other example with uneven node count as the qdevice vote
>> subsystems provides node count -1 votes.
>>
>> So if I see it correctly QDevices make only sense in case of even node
>> counts,
>> maybe especially 2 node setups as if qnetd works we have on more node
>> which may fail and if qnetd failed
>> we are as good as without it as qnted provides only one vote here.
>>
>> Am I missing something, or any thoughts to that?
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>