[ClusterLabs] Antw: Re: Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Thu Oct 6 10:05:24 UTC 2016
>>> Martin Schlegel <martin at nuboreto.org> schrieb am 06.10.2016 um 11:38 in
Nachricht
<1736253685.165937.28a72b84-a091-48c4-83f9-74a8bbde1a18.open-xchange at email.1und1
de>:
> Thanks for the confirmation Jan, but this sounds a bit scary to me !
>
> Spinning this experiment a bit further ...
>
> Would this not also mean that with a passive rrp with 2 rings it only takes
> 2
> different nodes that are not able to communicate on different networks at
> the
> same time to have all rings marked faulty on _every_node ... therefore all
> cluster members loosing quorum immediately even though n-2 cluster members
> are
> technically able to send and receive heartbeat messages through all 2 rings
> ?
>
> I really hope the answer is no and the cluster still somehow has a quorum in
> this case.
>
> Regards,
> Martin Schlegel
>
>
>> Jan Friesse <jfriesse at redhat.com> hat am 5. Oktober 2016 um 09:01 geschrieben:
>>
>> Martin,
>>
>> > Hello all,
>> >
>> > I am trying to understand why the following 2 Corosync heartbeat ring
>> > failure
>> > scenarios
>> > I have been testing and hope somebody can explain why this makes any sense.
>> >
>> > Consider the following cluster:
>> >
>> > * 3x Nodes: A, B and C
>> > * 2x NICs for each Node
>> > * Corosync 2.3.5 configured with "rrp_mode: passive" and
>> > udpu transport with ring id 0 and 1 on each node.
>> > * On each node "corosync-cfgtool -s" shows:
>> > [...] ring 0 active with no faults
>> > [...] ring 1 active with no faults
>> >
>> > Consider the following scenarios:
>> >
>> > 1. On node A only block all communication on the first NIC configured with
>> > ring id 0
>> > 2. On node A only block all communication on all NICs configured with
>> > ring id 0 and 1
>> >
>> > The result of the above scenarios is as follows:
>> >
>> > 1. Nodes A, B and C (!) display the following ring status:
>> > [...] Marking ringid 0 interface <IP-Address> FAULTY
>> > [...] ring 1 active with no faults
>> > 2. Node A is shown as OFFLINE - B and C display the following ring status:
>> > [...] ring 0 active with no faults
>> > [...] ring 1 active with no faults
>> >
>> > Questions:
>> > 1. Is this the expected outcome ?
>>
>> Yes
>>
>> > 2. In experiment 1. B and C can still communicate with each other over both
>> > NICs, so why are
>> > B and C not displaying a "no faults" status for ring id 0 and 1 just like
>> > in experiment 2.
>>
>> Because this is how RRP works. RRP marks whole ring as failed so every
>> node sees that ring as failed.
>>
>> > when node A is completely unreachable ?
>>
>> Because it's different scenario. In scenario 1 there are 3 nodes
>> membership where one of them has failed one ring -> whole ring is
>> failed. In scenario 2 there are 2 nodes membership where both rings
>> works as expected. Node A is completely unreachable and it's not in the
>> membership.
Did you ever wonder why it's named "ring"? ;-)
Our rings used to fail on network load periodically.
Regards,
Ulrich
>>
>> Regards,
>> Honza
>>
>> > Regards,
>> > Martin Schlegel
>> >
>> > _______________________________________________
>> > Users mailing list: Users at clusterlabs.org
>> > http://clusterlabs.org/mailman/listinfo/users
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>> >
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list