[ClusterLabs] Redundant ring not recovering after node is back

Thu Aug 23 06:40:44 UTC 2018

David,

> Hello,
> Im getting crazy about this problem, that I expect to resolve here, with
> your help guys:
> 
> I have 2 nodes with Corosync redundant ring feature.
> 
> Each node has 2 similarly connected/configured NIC's. Both nodes are
> connected each other by two crossover cables.

I believe this is root of the problem. Are you using NetworkManager? If 
so, have you installed NetworkManager-config-server? If not, please 
install it and test again.

> 
> I configured both nodes with rrp mode passive. Everything is working well
> at this point, but when I shutdown 1 node to test failover, and this node > returns to be online, corosync is marking the interface as FAULTY and rrp

I believe it's because with crossover cables configuration when other 
side is shutdown, NetworkManager detects it and does ifdown of the 
interface. And corosync is unable to handle ifdown properly. Ifdown is 
bad with single ring, but it's just killer with RRP (127.0.0.1 poisons 
every node in the cluster).

> fails to recover the initial state:
> 
> 1. Initial scenario:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>          id      = 192.168.0.1
>          status  = ring 0 active with no faults
> RING ID 1
>          id      = 192.168.1.1
>          status  = ring 1 active with no faults
> 
> 
> 2. When I shutdown the node 2, all continues with no faults. Sometimes the
> ring ID's are bonding with 127.0.0.1 and then bond back to their respective
> heartbeat IP.

Again, result of ifdown.

> 
> 3. When node 2 is back online:
> 
> # corosync-cfgtool -s
> Printing ring status.
> Local node ID 1
> RING ID 0
>          id      = 192.168.0.1
>          status  = ring 0 active with no faults
> RING ID 1
>          id      = 192.168.1.1
>          status  = Marking ringid 1 interface 192.168.1.1 FAULTY
> 
> 
> # service corosync status
> ● corosync.service - Corosync Cluster Engine
>     Loaded: loaded (/lib/systemd/system/corosync.service; enabled; vendor
> preset: enabled)
>     Active: active (running) since Wed 2018-08-22 14:44:09 CEST; 1min 38s ago
>       Docs: man:corosync
>             man:corosync.conf
>             man:corosync_overview
>   Main PID: 1439 (corosync)
>      Tasks: 2 (limit: 4915)
>     CGroup: /system.slice/corosync.service
>             └─1439 /usr/sbin/corosync -f
> 
> 
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.0.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]: Aug 22 14:44:11 notice  [TOTEM ] The
> network interface [192.168.1.1] is now up.
> Aug 22 14:44:11 node1 corosync[1439]:   [TOTEM ] The network interface
> [192.168.1.1] is now up.
> Aug 22 14:44:26 node1 corosync[1439]: Aug 22 14:44:26 notice  [TOTEM ] A
> new membership (192.168.0.1:601760) was formed. Members
> Aug 22 14:44:26 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601760) was formed. Members
> Aug 22 14:44:32 node1 corosync[1439]: Aug 22 14:44:32 notice  [TOTEM ] A
> new membership (192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:32 node1 corosync[1439]:   [TOTEM ] A new membership (
> 192.168.0.1:601764) was formed. Members joined: 2
> Aug 22 14:44:34 node1 corosync[1439]: Aug 22 14:44:34 error   [TOTEM ]
> Marking ringid 1 interface 192.168.1.1 FAULTY
> Aug 22 14:44:34 node1 corosync[1439]:   [TOTEM ] Marking ringid 1 interface
> 192.168.1.1 FAULTY
> 
> 
> If I execute corosync-cfgtool, clears the faulty error but after some
> seconds return to be FAULTY.
> The only thing that it resolves the problem is to restart de service with
> service corosync restart.
> 
> Here you have some of my configuration settings on node 1 (I probed already
> to change rrp_mode):
> 
> *- corosync.conf*
> 
> totem {
>          version: 2
>          cluster_name: node
>          token: 5000
>          token_retransmits_before_loss_const: 10
>          secauth: off
>          threads: 0
>          rrp_mode: passive
>          nodeid: 1
>          interface {
>                  ringnumber: 0
>                  bindnetaddr: 192.168.0.0
>                  #mcastaddr: 226.94.1.1
>                  mcastport: 5405
>                  broadcast: yes
>          }
>          interface {
>                  ringnumber: 1
>                  bindnetaddr: 192.168.1.0
>                  #mcastaddr: 226.94.1.2
>                  mcastport: 5407
>                  broadcast: yes
>          }
> }
> 
> logging {
>          fileline: off
>          to_stderr: yes
>          to_syslog: yes
>          to_logfile: yes
>          logfile: /var/log/corosync/corosync.log
>          debug: off
>          timestamp: on
>          logger_subsys {
>                  subsys: AMF
>                  debug: off
>          }
> }
> 
> amf {
>          mode: disabled
> }
> 
> quorum {
>          provider: corosync_votequorum
>          expected_votes: 2
> }
> 
> nodelist {
>          node {
>                  nodeid: 1
>                  ring0_addr: 192.168.0.1
>                  ring1_addr: 192.168.1.1
>          }
> 
>          node {
>                  nodeid: 2
>                  ring0_addr: 192.168.0.2
>                  ring1_addr: 192.168.1.2
>          }
> }
> 
> aisexec {
>          user: root
>          group: root
> }
> 
> service {
>          name: pacemaker
>          ver: 1
> }
> 
> 
> 
> *- /etc/hosts*
> 
> 
> 127.0.0.1       localhost
> 10.4.172.5      node1.upc.edu node1
> 10.4.172.6      node2.upc.edu node2
> 

So machines have 3 NICs? 2 for corosync/cluster traffic and one for 
regular traffic/services/outside world?

> 
> Thank you for you help in advance!

To conclude:
- If you are using NetworkManager, try to install 
NetworkManager-config-server, it will probably help
- If you are brave enough, try corosync 3.x (current Alpha4 is pretty 
stable - actually some other projects gain this stability with SP1 :) ) 
that has no RRP but uses knet for support redundant links (up-to 8 links 
can be configured) and doesn't have problems with ifdown.

Honza

> 
> 
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>