[ClusterLabs] Antw: Growing a cluster from 1 node without fencing

Tue Sep 5 11:27:59 EDT 2017

[Sorry for the long delay in replying, I was on vacation]

On 14/08/17 15:30, Klaus Wenninger wrote:
> If you have a disk you could use as shared-disk for sbd you could
> achieve a quorum-disk-like-behavior. (your package-versions
> look as if you are using RHEL-7.4)

Thanks for the suggestion, I've tried that and it solves the problem I
was having!

Tested with just 1 shared disk for now, and it doesn't fence when
growing the cluster from 1 node to 2.
In hindsight that makes sense: the 2nd node isn't running yet so there
is noone to tell node 1 to fence. This appears to be a lot more reliable
than the behaviour without shared disks.

> From sbd-pov this the expected behavior.
> sbd handles ignore, stop & freeze exactly the same by categorizing
> the problem as something transient that might be overcome within
> the watchdog-timeout.
> In case of suicide it would suicide immediately.
> Of course one might argue about if might make sense to not handle
> all 3 configurations the same in sbd - but that is how it is configured
> at the moment.

Would you consider accepting patches to make this configurable for the 1
node use-case without share-disk?

>>> Without cluster-property stonith-watchdog-timeout set to a
>>> value matching (twice is a good choice) the watchdog-timeout
>>> configured in /etc/sysconfig/sbd (default = 5s) a node will never
>>> assume the unseen partner as fenced.
>>> Anyway watchdog-only-sbd is of very limited use in 2-node
>>> scenarios. Kind of limits the availability to the one of the node
>>> that would win the tie_breaker-game. But might still be useful
>>> in certain scenarios of course. (like load-sharing ...)
>>
>> Good point.
>
> Still the question why you didn't set stonith-watchdog-timeout ...

I was trying to get a smaller testcase and removing commands which
didn't affect the outcome. I've added it back now.

>> Aug 14 08:57:30 cluster1 corosync[2208]:  [CFG   ] Config reload
>> requested by node 1
>> Aug 14 08:57:30 cluster1 corosync[2208]:  [TOTEM ] adding new UDPU
>> member {10.71.77.147}
>> Aug 14 08:57:30 cluster1 corosync[2208]:  [QUORUM] This node is within
>> the non-primary component and will NOT provide any services.
>> Aug 14 08:57:30 cluster1 corosync[2208]:  [QUORUM] Members[1]: 1
>> Aug 14 08:57:30 cluster1 crmd[2221]:  warning: Quorum lost
>> Aug 14 08:57:30 cluster1 pacemakerd[2215]:  warning: Quorum lost
>>
>> ^^^^^^^^^ Looks unexpected
>
> Not so familiar with how corosync handles dynamic config-changes.
> Maybe you are on the loosing side of the tie-breaker or wait-for-all is
> kicking in
> if configured.

Wait-for-all was not enabled, and IIUC tie-breaker should by default
make the node with smallest node id survive, which in this case was the
initial node.

> Would be interesting how 2-node-setting would handle that.
> But 2-node-setting would of course break quorum-based-fencing.

pcs wouldn't let me use SBD with 2-node mode, it insists on
auto-tie-breaker.

Best regards,
--Edwin