[Pacemaker] Corosync fails to start when NIC is absent

Kostiantyn Ponomarenko konstantin.ponomarenko at gmail.com
Tue Jan 13 14:35:05 CET 2015


Honza,

Thank you for helping me.
So, there is no defined behavior in case one of the interfaces is not in
the system?


Thank you,
Kostya

On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse at redhat.com> wrote:

> Kostiantyn,
>
>
> > According to the https://access.redhat.com/solutions/638843 , the
> > interface, that is defined in the corosync.conf, must be present in the
> > system (see at the bottom of the article, section "ROOT CAUSE").
> > To confirm that I made a couple of tests.
> >
> > Here is a part of the corosync.conf file (in a free-write form) (also
> > attached the origin config file):
> > ===============================
> > rrp_mode: passive
> > ring0_addr is defined in corosync.conf
> > ring1_addr is defined in corosync.conf
> > ===============================
> >
> > -------------------------------
> >
> > Two-node cluster
> >
> > -------------------------------
> >
> > Test #1:
> > --------------------------------------------------
> > IP for ring0 is not defines in the system:
> > --------------------------------------------------
> > Start Corosync simultaneously on both nodes.
> > Corosync fails to start.
> > From the logs:
> > Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
> > config: No interfaces defined
> > Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
> > Engine exiting with status 8 at main.c:1343.
> > Result: Corosync and Pacemaker are not running.
> >
> > Test #2:
> > --------------------------------------------------
> > IP for ring1 is not defines in the system:
> > --------------------------------------------------
> > Start Corosync simultaneously on both nodes.
> > Corosync starts.
> > Start Pacemaker simultaneously on both nodes.
> > Pacemaker fails to start.
> > From the logs, the last writes from the "corosync":
> > Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
> > interface 169.254.1.3 FAULTY
> > Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
> > recovered ring 0
> > Result: Corosync and Pacemaker are not running.
> >
> >
> > Test #3:
> >
> > "rrp_mode: active" leads to the same result, except Corosync and
> Pacemaker
> > init scripts return status "running".
> > But still "vim /var/log/cluster/corosync.log" shows a lot of errors like:
> > Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
> > to the CPG API failed: Library error (2)
> >
> > Result: Corosync and Pacemaker show their statuses as "running", but
> > "crm_mon" cannot connect to the cluster database. And half of the
> > Pacemaker's services are not running (including Cluster Information Base
> > (CIB)).
> >
> >
> > -------------------------------
> >
> > For a single node mode
> >
> > -------------------------------
> >
> > IP for ring0 is not defines in the system:
> >
> > Corosync fails to start.
> >
> > IP for ring1 is not defines in the system:
> >
> > Corosync and Pacemaker are started.
> >
> > It is possible that configuration will be applied successfully (50%),
> >
> > and it is possible that the cluster is not running any resources,
> >
> > and it is possible that the node cannot be put in a standby mode (shows:
> > communication error),
> >
> > and it is possible that the cluster is running all resources, but applied
> > configuration is not guaranteed to be fully loaded (some rules can be
> > missed).
> >
> >
> > -------------------------------
> >
> > Conclusions:
> >
> > -------------------------------
> >
> > It is possible that in some rare cases (see comments to the bug) the
> > cluster will work, but in that case its working state is unstable and the
> > cluster can stop working every moment.
> >
> >
> > So, is it correct? Does my assumptions make any sense? I didn't any other
> > explanation in the network ... .
>
> Corosync needs all interfaces during start and runtime. This doesn't
> mean they must be connected (this would make corosync unusable for
> physical NIC/Switch or cable failure), but they must be up and have
> correct ip.
>
> When this is not the case, corosync rebinds to localhost and weird
> things happens. Removal of this rebinding is long time TODO, but there
> are still more important bugs (especially because rebind can be avoided).
>
> Regards,
>   Honza
>
> >
> >
> >
> > Thank you,
> > Kostya
> >
> > On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> > konstantin.ponomarenko at gmail.com> wrote:
> >
> >> Hi guys,
> >>
> >> Corosync fails to start if there is no such network interface configured
> >> in the system.
> >> Even with "rrp_mode: passive" the problem is the same when at least one
> >> network interface is not configured in the system.
> >>
> >> Is this the expected behavior?
> >> I thought that when you use redundant rings, it is enough to have at
> least
> >> one NIC configured in the system. Am I wrong?
> >>
> >> Thank you,
> >> Kostya
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150113/8e6d3352/attachment-0001.html>


More information about the Pacemaker mailing list