[ClusterLabs] Failure of preferred node in a 2 node cluster

Mon Apr 30 02:51:26 EDT 2018

On 29/04/18 13:22, Andrei Borzenkov wrote:
> 29.04.2018 04:19, Wei Shan пишет:
>> Hi,
>>
>> I'm using Redhat Cluster Suite 7with watchdog timer based fence agent. I
>> understand this is a really bad setup but this is what the end-user wants.
>>
>> ATB => auto_tie_breaker
>>
>> "When the auto_tie_breaker is used in even-number member clusters, then the
>> failure of the partition containing the auto_tie_breaker_node (by default
>> the node with lowest ID) will cause other partition to become inquorate and
>> it will self-fence. In 2-node clusters with auto_tie_breaker this means
>> that failure of node favoured by auto_tie_breaker_node (typically nodeid 1)
>> will result in reboot of other node (typically nodeid 2) that detects the
>> inquorate state. If this is undesirable then corosync-qdevice can be used
>> instead of the auto_tie_breaker to provide additional vote to quorum making
>> behaviour closer to odd-number member clusters."
>>
> 
> That's not what upstream corosync manual pages says. Corosync itself
> won't initiate self-fencing, it just marks node as being out of quorum.
> What happens later depends on higher layers like pacemaker. Pacemaker
> can be configured to commit suicide, but can also be configured to
> ignore quorum completely. I am not familiar with details how RHCS
> behaves by default.
> 
> I just tested on vanilla corosync+pacemaker (openSUSE Tumbleweed) and
> nothing happens when I kill lowest node in two-node configuration.
> 

That is the expected behaviour for a 2 node ATB cluster. If the
preferred node is not available then the remaining node will stall until
it comes back again. It sound odd, but that's what happens. A preferred
node is a preferred node. If it can move from one to the other when it
fails then it's not a preferred node ... it's just a node :)

If you need full resilient failover for 2 nodes then qdevice is more
likely what you need.

Chrissie

> If your cluster nodes are configured to commit suicide, what happens
> after reboot depends on at least wait_for_all corosync setting. With
> wait_for_all=1 (default in two_node) and without a) ignoring quorum
> state and b) having fencing resource pacemaker on your node will wait
> indefinitely after reboot because partner is not available.
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>