[ClusterLabs] Is corosync supposed to be restarted if it fies?

Jan Pokorný jpokorny at redhat.com
Wed Nov 29 16:00:36 EST 2017


On 28/11/17 22:35 +0300, Andrei Borzenkov wrote:
> 28.11.2017 13:01, Jan Pokorný пишет:
>> On 27/11/17 17:43 +0300, Andrei Borzenkov wrote:
>>> Отправлено с iPhone
>>> 
>>>> 27 нояб. 2017 г., в 14:36, Ferenc Wágner <wferi at niif.hu> написал(а):
>>>> 
>>>> Andrei Borzenkov <arvidjaar at gmail.com> writes:
>>>> 
>>>>> 25.11.2017 10:05, Andrei Borzenkov пишет:
>>>>> 
>>>>>> In one of guides suggested procedure to simulate split brain was to kill
>>>>>> corosync process. It actually worked on one cluster, but on another
>>>>>> corosync process was restarted after being killed without cluster
>>>>>> noticing anything. Except after several attempts pacemaker died with
>>>>>> stopping resources ... :)
>>>>>> 
>>>>>> This is SLES12 SP2; I do not see any Restart in service definition so it
>>>>>> probably not systemd.
>>>>>> 
>>>>> FTR - it was not corosync, but pacemakker; its unit file specifies
>>>>> RestartOn=error so killing corosync caused pacemaker to fail and be
>>>>> restarted by systemd.
>>>> 
>>>> And starting corosync via a Requires dependency?
>>> 
>>> Exactly.
>> 
>> From my testing it looks like we should change
>> "Requires=corosync.service" to "BindsTo=corosync.service"
>> in pacemaker.service.
>> 
>> Could you give it a try?
>> 
> 
> I'm not sure what is expected outcome, but pacemaker.service is still
> restarted (due to Restart=on-failure).

Expected outcome is that pacemaker.service will become
"inactive (dead)" after killing corosync (as a result of being
"bound" by pacemaker).  Have you indeed issued "systemctl
daemon-reload" after updating the pacemaker unit file?

(FTR, I tried with systemd 235).

> If intention is to unconditionally stop it when corosync dies,
> pacemaker should probably exit with unique code and unit files have
> RestartPreventExitStatus set to it.

That would be an elaborate way to reach the same.

But good point in questioning what's the "best intention" around these
scenarios -- normally, fencing would happen, but as you note, the node
had actually survived by being fast enough to put corosync back to
life, and from there, whether it adds any value to have pacemaker
restarted on non-clean terminations at all.  I don't know.

Would it make more sense to have FailureAction=reboot-immediate to
at least in part emulate the fencing instead?

-- 
Jan (Poki)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20171129/31c24327/attachment-0003.sig>


More information about the Users mailing list