[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Klaus Wenninger
kwenning at redhat.com
Thu Oct 6 14:26:25 UTC 2016
On 10/06/2016 04:16 PM, Digimer wrote:
> On 06/10/16 05:38 AM, Martin Schlegel wrote:
>> Thanks for the confirmation Jan, but this sounds a bit scary to me !
>>
>> Spinning this experiment a bit further ...
>>
>> Would this not also mean that with a passive rrp with 2 rings it only takes 2
>> different nodes that are not able to communicate on different networks at the
>> same time to have all rings marked faulty on _every_node ... therefore all
>> cluster members loosing quorum immediately even though n-2 cluster members are
>> technically able to send and receive heartbeat messages through all 2 rings ?
>>
>> I really hope the answer is no and the cluster still somehow has a quorum in
>> this case.
>>
>> Regards,
>> Martin Schlegel
>>
>>
>>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
>>>
>>> Martin,
>>>
>>>> Hello all,
>>>>
>>>> I am trying to understand why the following 2 Corosync heartbeat ring
>>>> failure
>>>> scenarios
>>>> I have been testing and hope somebody can explain why this makes any sense.
>>>>
>>>> Consider the following cluster:
>>>>
>>>> * 3x Nodes: A, B and C
>>>> * 2x NICs for each Node
>>>> * Corosync 2.3.5 configured with "rrp_mode: passive" and
>>>> udpu transport with ring id 0 and 1 on each node.
>>>> * On each node "corosync-cfgtool -s" shows:
>>>> [...] ring 0 active with no faults
>>>> [...] ring 1 active with no faults
>>>>
>>>> Consider the following scenarios:
>>>>
>>>> 1. On node A only block all communication on the first NIC configured with
>>>> ring id 0
>>>> 2. On node A only block all communication on all NICs configured with
>>>> ring id 0 and 1
>>>>
>>>> The result of the above scenarios is as follows:
>>>>
>>>> 1. Nodes A, B and C (!) display the following ring status:
>>>> [...] Marking ringid 0 interface <IP-Address> FAULTY
>>>> [...] ring 1 active with no faults
>>>> 2. Node A is shown as OFFLINE - B and C display the following ring status:
>>>> [...] ring 0 active with no faults
>>>> [...] ring 1 active with no faults
>>>>
>>>> Questions:
>>>> 1. Is this the expected outcome ?
>>> Yes
>>>
>>>> 2. In experiment 1. B and C can still communicate with each other over both
>>>> NICs, so why are
>>>> B and C not displaying a "no faults" status for ring id 0 and 1 just like
>>>> in experiment 2.
>>> Because this is how RRP works. RRP marks whole ring as failed so every
>>> node sees that ring as failed.
>>>
>>>> when node A is completely unreachable ?
>>> Because it's different scenario. In scenario 1 there are 3 nodes
>>> membership where one of them has failed one ring -> whole ring is
>>> failed. In scenario 2 there are 2 nodes membership where both rings
>>> works as expected. Node A is completely unreachable and it's not in the
>>> membership.
>>>
>>> Regards,
>>> Honza
> Have you considered using active/passive bonded interfaces? If you did,
> you would be able to fail links in any order on the nodes and corosync
> would not know the difference.
>
Still an interesting point I hadn't been aware of that far - although
I knew the bits I probably hadn't thought about them enough till
now...
Usually one - at least me so far - would rather think that having
the awareness of redundany/cluster as high up as possible in the
protocol/application-stack would open up possibilities for more
appropriate reactions.
More information about the Users
mailing list