[Pacemaker] Corosync fails to start when NIC is absent
Kostiantyn Ponomarenko
konstantin.ponomarenko at gmail.com
Mon Jan 19 14:57:53 UTC 2015
One more thing to clarify.
You said "rebind can be avoided" - what does it mean?
Thank you,
Kostya
On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
konstantin.ponomarenko at gmail.com> wrote:
> Thank you. Now I am aware of it.
>
> Thank you,
> Kostya
>
> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse at redhat.com> wrote:
>
>> Kostiantyn,
>>
>> > Honza,
>> >
>> > Thank you for helping me.
>> > So, there is no defined behavior in case one of the interfaces is not in
>> > the system?
>>
>> You are right. There is no defined behavior.
>>
>> Regards,
>> Honza
>>
>>
>> >
>> >
>> > Thank you,
>> > Kostya
>> >
>> > On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse at redhat.com>
>> wrote:
>> >
>> >> Kostiantyn,
>> >>
>> >>
>> >>> According to the https://access.redhat.com/solutions/638843 , the
>> >>> interface, that is defined in the corosync.conf, must be present in
>> the
>> >>> system (see at the bottom of the article, section "ROOT CAUSE").
>> >>> To confirm that I made a couple of tests.
>> >>>
>> >>> Here is a part of the corosync.conf file (in a free-write form) (also
>> >>> attached the origin config file):
>> >>> ===============================
>> >>> rrp_mode: passive
>> >>> ring0_addr is defined in corosync.conf
>> >>> ring1_addr is defined in corosync.conf
>> >>> ===============================
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Two-node cluster
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Test #1:
>> >>> --------------------------------------------------
>> >>> IP for ring0 is not defines in the system:
>> >>> --------------------------------------------------
>> >>> Start Corosync simultaneously on both nodes.
>> >>> Corosync fails to start.
>> >>> From the logs:
>> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
>> >>> config: No interfaces defined
>> >>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
>> Cluster
>> >>> Engine exiting with status 8 at main.c:1343.
>> >>> Result: Corosync and Pacemaker are not running.
>> >>>
>> >>> Test #2:
>> >>> --------------------------------------------------
>> >>> IP for ring1 is not defines in the system:
>> >>> --------------------------------------------------
>> >>> Start Corosync simultaneously on both nodes.
>> >>> Corosync starts.
>> >>> Start Pacemaker simultaneously on both nodes.
>> >>> Pacemaker fails to start.
>> >>> From the logs, the last writes from the "corosync":
>> >>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid
>> 0
>> >>> interface 169.254.1.3 FAULTY
>> >>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
>> Automatically
>> >>> recovered ring 0
>> >>> Result: Corosync and Pacemaker are not running.
>> >>>
>> >>>
>> >>> Test #3:
>> >>>
>> >>> "rrp_mode: active" leads to the same result, except Corosync and
>> >> Pacemaker
>> >>> init scripts return status "running".
>> >>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
>> like:
>> >>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
>> Connection
>> >>> to the CPG API failed: Library error (2)
>> >>>
>> >>> Result: Corosync and Pacemaker show their statuses as "running", but
>> >>> "crm_mon" cannot connect to the cluster database. And half of the
>> >>> Pacemaker's services are not running (including Cluster Information
>> Base
>> >>> (CIB)).
>> >>>
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> For a single node mode
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> IP for ring0 is not defines in the system:
>> >>>
>> >>> Corosync fails to start.
>> >>>
>> >>> IP for ring1 is not defines in the system:
>> >>>
>> >>> Corosync and Pacemaker are started.
>> >>>
>> >>> It is possible that configuration will be applied successfully (50%),
>> >>>
>> >>> and it is possible that the cluster is not running any resources,
>> >>>
>> >>> and it is possible that the node cannot be put in a standby mode
>> (shows:
>> >>> communication error),
>> >>>
>> >>> and it is possible that the cluster is running all resources, but
>> applied
>> >>> configuration is not guaranteed to be fully loaded (some rules can be
>> >>> missed).
>> >>>
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> Conclusions:
>> >>>
>> >>> -------------------------------
>> >>>
>> >>> It is possible that in some rare cases (see comments to the bug) the
>> >>> cluster will work, but in that case its working state is unstable and
>> the
>> >>> cluster can stop working every moment.
>> >>>
>> >>>
>> >>> So, is it correct? Does my assumptions make any sense? I didn't any
>> other
>> >>> explanation in the network ... .
>> >>
>> >> Corosync needs all interfaces during start and runtime. This doesn't
>> >> mean they must be connected (this would make corosync unusable for
>> >> physical NIC/Switch or cable failure), but they must be up and have
>> >> correct ip.
>> >>
>> >> When this is not the case, corosync rebinds to localhost and weird
>> >> things happens. Removal of this rebinding is long time TODO, but there
>> >> are still more important bugs (especially because rebind can be
>> avoided).
>> >>
>> >> Regards,
>> >> Honza
>> >>
>> >>>
>> >>>
>> >>>
>> >>> Thank you,
>> >>> Kostya
>> >>>
>> >>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
>> >>> konstantin.ponomarenko at gmail.com> wrote:
>> >>>
>> >>>> Hi guys,
>> >>>>
>> >>>> Corosync fails to start if there is no such network interface
>> configured
>> >>>> in the system.
>> >>>> Even with "rrp_mode: passive" the problem is the same when at least
>> one
>> >>>> network interface is not configured in the system.
>> >>>>
>> >>>> Is this the expected behavior?
>> >>>> I thought that when you use redundant rings, it is enough to have at
>> >> least
>> >>>> one NIC configured in the system. Am I wrong?
>> >>>>
>> >>>> Thank you,
>> >>>> Kostya
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>
>> >>> Project Home: http://www.clusterlabs.org
>> >>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>> >>
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150119/ad673982/attachment.htm>
More information about the Pacemaker
mailing list