[Pacemaker] Corosync fails to start when NIC is absent

Kostiantyn Ponomarenko konstantin.ponomarenko at gmail.com
Wed Jan 14 12:31:31 CET 2015


Thank you. Now I am aware of it.

Thank you,
Kostya

On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse at redhat.com> wrote:

> Kostiantyn,
>
> > Honza,
> >
> > Thank you for helping me.
> > So, there is no defined behavior in case one of the interfaces is not in
> > the system?
>
> You are right. There is no defined behavior.
>
> Regards,
>   Honza
>
>
> >
> >
> > Thank you,
> > Kostya
> >
> > On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse at redhat.com>
> wrote:
> >
> >> Kostiantyn,
> >>
> >>
> >>> According to the https://access.redhat.com/solutions/638843 , the
> >>> interface, that is defined in the corosync.conf, must be present in the
> >>> system (see at the bottom of the article, section "ROOT CAUSE").
> >>> To confirm that I made a couple of tests.
> >>>
> >>> Here is a part of the corosync.conf file (in a free-write form) (also
> >>> attached the origin config file):
> >>> ===============================
> >>> rrp_mode: passive
> >>> ring0_addr is defined in corosync.conf
> >>> ring1_addr is defined in corosync.conf
> >>> ===============================
> >>>
> >>> -------------------------------
> >>>
> >>> Two-node cluster
> >>>
> >>> -------------------------------
> >>>
> >>> Test #1:
> >>> --------------------------------------------------
> >>> IP for ring0 is not defines in the system:
> >>> --------------------------------------------------
> >>> Start Corosync simultaneously on both nodes.
> >>> Corosync fails to start.
> >>> From the logs:
> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
> >>> config: No interfaces defined
> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
> >>> Engine exiting with status 8 at main.c:1343.
> >>> Result: Corosync and Pacemaker are not running.
> >>>
> >>> Test #2:
> >>> --------------------------------------------------
> >>> IP for ring1 is not defines in the system:
> >>> --------------------------------------------------
> >>> Start Corosync simultaneously on both nodes.
> >>> Corosync starts.
> >>> Start Pacemaker simultaneously on both nodes.
> >>> Pacemaker fails to start.
> >>> From the logs, the last writes from the "corosync":
> >>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid 0
> >>> interface 169.254.1.3 FAULTY
> >>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ] Automatically
> >>> recovered ring 0
> >>> Result: Corosync and Pacemaker are not running.
> >>>
> >>>
> >>> Test #3:
> >>>
> >>> "rrp_mode: active" leads to the same result, except Corosync and
> >> Pacemaker
> >>> init scripts return status "running".
> >>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
> like:
> >>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
> Connection
> >>> to the CPG API failed: Library error (2)
> >>>
> >>> Result: Corosync and Pacemaker show their statuses as "running", but
> >>> "crm_mon" cannot connect to the cluster database. And half of the
> >>> Pacemaker's services are not running (including Cluster Information
> Base
> >>> (CIB)).
> >>>
> >>>
> >>> -------------------------------
> >>>
> >>> For a single node mode
> >>>
> >>> -------------------------------
> >>>
> >>> IP for ring0 is not defines in the system:
> >>>
> >>> Corosync fails to start.
> >>>
> >>> IP for ring1 is not defines in the system:
> >>>
> >>> Corosync and Pacemaker are started.
> >>>
> >>> It is possible that configuration will be applied successfully (50%),
> >>>
> >>> and it is possible that the cluster is not running any resources,
> >>>
> >>> and it is possible that the node cannot be put in a standby mode
> (shows:
> >>> communication error),
> >>>
> >>> and it is possible that the cluster is running all resources, but
> applied
> >>> configuration is not guaranteed to be fully loaded (some rules can be
> >>> missed).
> >>>
> >>>
> >>> -------------------------------
> >>>
> >>> Conclusions:
> >>>
> >>> -------------------------------
> >>>
> >>> It is possible that in some rare cases (see comments to the bug) the
> >>> cluster will work, but in that case its working state is unstable and
> the
> >>> cluster can stop working every moment.
> >>>
> >>>
> >>> So, is it correct? Does my assumptions make any sense? I didn't any
> other
> >>> explanation in the network ... .
> >>
> >> Corosync needs all interfaces during start and runtime. This doesn't
> >> mean they must be connected (this would make corosync unusable for
> >> physical NIC/Switch or cable failure), but they must be up and have
> >> correct ip.
> >>
> >> When this is not the case, corosync rebinds to localhost and weird
> >> things happens. Removal of this rebinding is long time TODO, but there
> >> are still more important bugs (especially because rebind can be
> avoided).
> >>
> >> Regards,
> >>   Honza
> >>
> >>>
> >>>
> >>>
> >>> Thank you,
> >>> Kostya
> >>>
> >>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
> >>> konstantin.ponomarenko at gmail.com> wrote:
> >>>
> >>>> Hi guys,
> >>>>
> >>>> Corosync fails to start if there is no such network interface
> configured
> >>>> in the system.
> >>>> Even with "rrp_mode: passive" the problem is the same when at least
> one
> >>>> network interface is not configured in the system.
> >>>>
> >>>> Is this the expected behavior?
> >>>> I thought that when you use redundant rings, it is enough to have at
> >> least
> >>>> one NIC configured in the system. Am I wrong?
> >>>>
> >>>> Thank you,
> >>>> Kostya
> >>>>
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150114/42a7737e/attachment.html>


More information about the Pacemaker mailing list