[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
Bogdan Dobrelya
bdobrelia at mirantis.com
Fri Feb 5 13:11:27 CET 2016
On 04.02.2016 15:43, Bogdan Dobrelya wrote:
> Hello.
> Regarding the original issue, good news are the resource-agents
> ocf-shellfuncs is no more causing fork bombs to the dummy OCF RA [0]
> after the fix [1] done. The bad news are that "self-forking" monitors
> issue seems remaining for the rabbitmq OCF RA [2], and I can reproduce
> it for another custom agent [3], so I'd guess it may be a valid for
> another ones as well.
>
> IIUC, the issue seems related to how lrmd's forking monitor actions.
> I tried to debug both pacemaker 1.1.10, 1.1.12 with gdb as the following:
>
> # cat ./cmds
> set follow-fork-mode child
> set detach-on-fork off
> set follow-exec-mode new
> catch fork
> catch vfork
> cont
> # gdb -x cmds /usr/lib/pacemaker/lrmd `pgrep lrmd`
>
> I can confirm it catches forked monitors and makes nested forks as well.
> But I have *many* debug symbols missing, bt is full of question marks
> and, honestly, I'm not a gdb guru and do not now that to check in for
> reproduced cases.
>
> So any help with how to troubleshooting things further are very appreciated!
I figured out this is expected behaviour. There are no fork bombs left,
but usual fork & exec syscalls each time the OCF RA is calling a shell
command or ocf_run, ocf_log functions. And those false "self-forks" are
nothing more but a transient state between the fork and exec calls, when
the caption of the child process has yet to be updated... So I believe
the problem was solved by the aforementioned patch completely.
>
> [0] https://github.com/bogdando/dummy-ocf-ra
> [1] https://github.com/ClusterLabs/resource-agents/issues/734
> [2]
> https://github.com/rabbitmq/rabbitmq-server/blob/master/scripts/rabbitmq-server-ha.ocf
> [3]
> https://git.openstack.org/cgit/openstack/fuel-library/tree/files/fuel-ha-utils/ocf/ns_vrouter
>
> On 04.01.2016 17:33, Bogdan Dobrelya wrote:
>> On 04.01.2016 17:14, Dejan Muhamedagic wrote:
>>> Hi,
>>>
>>> On Mon, Jan 04, 2016 at 04:52:43PM +0100, Bogdan Dobrelya wrote:
>>>> On 04.01.2016 16:36, Ken Gaillot wrote:
>>>>> On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
>>>>>> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>>> [...]
>>>>>> Also note, that lrmd spawns *many* monitors like:
>>>>>> root 6495 0.0 0.0 70268 1456 ? Ss 2015 4:56 \_
>>>>>> /usr/lib/pacemaker/lrmd
>>>>>> root 31815 0.0 0.0 4440 780 ? S 15:08 0:00 | \_
>>>>>> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root 31908 0.0 0.0 4440 388 ? S 15:08 0:00 |
>>>>>> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root 31910 0.0 0.0 4440 384 ? S 15:08 0:00 |
>>>>>> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> root 31915 0.0 0.0 4440 392 ? S 15:08 0:00 |
>>>>>> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
>>>>>> ...
>>>>>
>>>>> At first glance, that looks like your monitor action is calling itself
>>>>> recursively, but I don't see how in your code.
>>>>
>>>> Yes, it should be a bug in the ocf-shellfuncs's ocf_log().
>>>
>>> If you're sure about that, please open an issue at
>>> https://github.com/ClusterLabs/resource-agents/issues
>>
>> Submitted [0]. Thank you!
>> Note, that it seems the very import action causes the issue, not the
>> ocf_run or ocf_log code itself.
>>
>> [0] https://github.com/ClusterLabs/resource-agents/issues/734
>>
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>> _______________________________________________
>>> Users mailing list: Users at clusterlabs.org
>>> http://clusterlabs.org/mailman/listinfo/users
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>
>
--
Best regards,
Bogdan Dobrelya,
Irc #bogdando
More information about the Users
mailing list