[Pacemaker] Corosync fails to start when NIC is absent

Tue Jan 20 08:50:43 UTC 2015

Kostiantyn,

> One more thing to clarify.
> You said "rebind can be avoided" - what does it mean?

By that I mean that as long as you don't shutdown interface everything
will work as expected. Interface shutdown is administrator decision,
system doesn't do it automagically :)

Regards,
  Honza

> 
> Thank you,
> Kostya
> 
> On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko <
> konstantin.ponomarenko at gmail.com> wrote:
> 
>> Thank you. Now I am aware of it.
>>
>> Thank you,
>> Kostya
>>
>> On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse <jfriesse at redhat.com> wrote:
>>
>>> Kostiantyn,
>>>
>>>> Honza,
>>>>
>>>> Thank you for helping me.
>>>> So, there is no defined behavior in case one of the interfaces is not in
>>>> the system?
>>>
>>> You are right. There is no defined behavior.
>>>
>>> Regards,
>>>   Honza
>>>
>>>
>>>>
>>>>
>>>> Thank you,
>>>> Kostya
>>>>
>>>> On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse <jfriesse at redhat.com>
>>> wrote:
>>>>
>>>>> Kostiantyn,
>>>>>
>>>>>
>>>>>> According to the https://access.redhat.com/solutions/638843 , the
>>>>>> interface, that is defined in the corosync.conf, must be present in
>>> the
>>>>>> system (see at the bottom of the article, section "ROOT CAUSE").
>>>>>> To confirm that I made a couple of tests.
>>>>>>
>>>>>> Here is a part of the corosync.conf file (in a free-write form) (also
>>>>>> attached the origin config file):
>>>>>> ===============================
>>>>>> rrp_mode: passive
>>>>>> ring0_addr is defined in corosync.conf
>>>>>> ring1_addr is defined in corosync.conf
>>>>>> ===============================
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Two-node cluster
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Test #1:
>>>>>> --------------------------------------------------
>>>>>> IP for ring0 is not defines in the system:
>>>>>> --------------------------------------------------
>>>>>> Start Corosync simultaneously on both nodes.
>>>>>> Corosync fails to start.
>>>>>> From the logs:
>>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
>>>>>> config: No interfaces defined
>>>>>> Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
>>> Cluster
>>>>>> Engine exiting with status 8 at main.c:1343.
>>>>>> Result: Corosync and Pacemaker are not running.
>>>>>>
>>>>>> Test #2:
>>>>>> --------------------------------------------------
>>>>>> IP for ring1 is not defines in the system:
>>>>>> --------------------------------------------------
>>>>>> Start Corosync simultaneously on both nodes.
>>>>>> Corosync starts.
>>>>>> Start Pacemaker simultaneously on both nodes.
>>>>>> Pacemaker fails to start.
>>>>>> From the logs, the last writes from the "corosync":
>>>>>> Jan 8 16:31:29 daemon.err<27> corosync[3728]: [TOTEM ] Marking ringid
>>> 0
>>>>>> interface 169.254.1.3 FAULTY
>>>>>> Jan 8 16:31:30 daemon.notice<29> corosync[3728]: [TOTEM ]
>>> Automatically
>>>>>> recovered ring 0
>>>>>> Result: Corosync and Pacemaker are not running.
>>>>>>
>>>>>>
>>>>>> Test #3:
>>>>>>
>>>>>> "rrp_mode: active" leads to the same result, except Corosync and
>>>>> Pacemaker
>>>>>> init scripts return status "running".
>>>>>> But still "vim /var/log/cluster/corosync.log" shows a lot of errors
>>> like:
>>>>>> Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
>>> Connection
>>>>>> to the CPG API failed: Library error (2)
>>>>>>
>>>>>> Result: Corosync and Pacemaker show their statuses as "running", but
>>>>>> "crm_mon" cannot connect to the cluster database. And half of the
>>>>>> Pacemaker's services are not running (including Cluster Information
>>> Base
>>>>>> (CIB)).
>>>>>>
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> For a single node mode
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> IP for ring0 is not defines in the system:
>>>>>>
>>>>>> Corosync fails to start.
>>>>>>
>>>>>> IP for ring1 is not defines in the system:
>>>>>>
>>>>>> Corosync and Pacemaker are started.
>>>>>>
>>>>>> It is possible that configuration will be applied successfully (50%),
>>>>>>
>>>>>> and it is possible that the cluster is not running any resources,
>>>>>>
>>>>>> and it is possible that the node cannot be put in a standby mode
>>> (shows:
>>>>>> communication error),
>>>>>>
>>>>>> and it is possible that the cluster is running all resources, but
>>> applied
>>>>>> configuration is not guaranteed to be fully loaded (some rules can be
>>>>>> missed).
>>>>>>
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> Conclusions:
>>>>>>
>>>>>> -------------------------------
>>>>>>
>>>>>> It is possible that in some rare cases (see comments to the bug) the
>>>>>> cluster will work, but in that case its working state is unstable and
>>> the
>>>>>> cluster can stop working every moment.
>>>>>>
>>>>>>
>>>>>> So, is it correct? Does my assumptions make any sense? I didn't any
>>> other
>>>>>> explanation in the network ... .
>>>>>
>>>>> Corosync needs all interfaces during start and runtime. This doesn't
>>>>> mean they must be connected (this would make corosync unusable for
>>>>> physical NIC/Switch or cable failure), but they must be up and have
>>>>> correct ip.
>>>>>
>>>>> When this is not the case, corosync rebinds to localhost and weird
>>>>> things happens. Removal of this rebinding is long time TODO, but there
>>>>> are still more important bugs (especially because rebind can be
>>> avoided).
>>>>>
>>>>> Regards,
>>>>>   Honza
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you,
>>>>>> Kostya
>>>>>>
>>>>>> On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko <
>>>>>> konstantin.ponomarenko at gmail.com> wrote:
>>>>>>
>>>>>>> Hi guys,
>>>>>>>
>>>>>>> Corosync fails to start if there is no such network interface
>>> configured
>>>>>>> in the system.
>>>>>>> Even with "rrp_mode: passive" the problem is the same when at least
>>> one
>>>>>>> network interface is not configured in the system.
>>>>>>>
>>>>>>> Is this the expected behavior?
>>>>>>> I thought that when you use redundant rings, it is enough to have at
>>>>> least
>>>>>>> one NIC configured in the system. Am I wrong?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Kostya
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>