[Pacemaker] hangs pending
Andrey Groshev
greenx at yandex.ru
Mon Mar 3 22:08:17 UTC 2014
Hi!
I thought that all the bugs have already been caught. :)
But today(already tonight) build last git PCMK with add upstart.
And again catch hangs pending.
logs http://send2me.ru/pcmk-04-Mar-2014.tar.bz2
24.02.2014, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
> On 22 Feb 2014, at 7:07 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>> 21.02.2014, 04:00, "Andrew Beekhof" <andrew at beekhof.net>:
>>> On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>> 20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>> On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>> 20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>> On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>> 19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>> On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>>>
>>>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>> In general - I am happy (almost like an elephant) :)
>>>>>>>>>> Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>>>
>>>>>>>>>> Turned four groups:
>>>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>>> excellent
>>>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>>>> right
>>>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>> Sometimes restart daemon
>>>>>>>>> The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>>>> If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>>>> If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>>>>>
>>>>>>>>> So fence/not fence is normal and to be expected.
>>>>>>>>>> and restart resources with large delay MS:pgsql.
>>>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>>>> I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>>>> In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>>>>>
>>>>>>>>> If the agent reports "all good", then there is nothing more to do.
>>>>>>>>> If the agent is not reporting "all good", you should really be asking why.
>>>>>>>>>> 4. pacemakerd - nothing happens.
>>>>>>>>> On non-systemd based machines, correct.
>>>>>>>>>
>>>>>>>>> On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>>>> Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>>>> And! I almost forgot about IT!
>>>>>>>> Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>>>> Without this ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>>>> Otherwise - it's a full epic fail ;)
>>>>>>> -ENOPARSE
>>>>>> OK, I remove my personal attitude to "systemd".
>>>>>> Let me explain.
>>>>>>
>>>>>> Somewhere in the beginning of this topic, I wrote:
>>>>>> A.G.:Who knows who runs lrmd?
>>>>>> A.B.:Pacemakerd.
>>>>>> That's one!
>>>>>>
>>>>>> Let's see the list of processes:
>>>>>> #ps -axf
>>>>>> .....
>>>>>> 6067 ? Ssl 7:24 corosync
>>>>>> 6092 ? S 0:25 pacemakerd
>>>>>> 6094 ? Ss 116:13 \_ /usr/libexec/pacemaker/cib
>>>>>> 6095 ? Ss 0:25 \_ /usr/libexec/pacemaker/stonithd
>>>>>> 6096 ? Ss 1:27 \_ /usr/libexec/pacemaker/lrmd
>>>>>> 6097 ? Ss 0:49 \_ /usr/libexec/pacemaker/attrd
>>>>>> 6098 ? Ss 0:25 \_ /usr/libexec/pacemaker/pengine
>>>>>> 6099 ? Ss 0:29 \_ /usr/libexec/pacemaker/crmd
>>>>>> .....
>>>>>> That's two!
>>>>> Whats two? I don't follow.
>>>> In the sense that it creates other processes. But it does not matter.
>>>>>> And more, more...
>>>>>> Now you must understand - why I want this process to work always.
>>>>>> Even I think, No need for anyone here to explain it!
>>>>>>
>>>>>> And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>>>>> No, I;m saying it works _better_ on systemd distros.
>>>>> On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>>>> Can you show me the place where:
>>>> "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?
>>> The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()
>>>
>>> The ps tree will look different though
>>>
>>> 6094 ? Ss 116:13 /usr/libexec/pacemaker/cib
>>> 6095 ? Ss 0:25 /usr/libexec/pacemaker/stonithd
>>> 6096 ? Ss 1:27 /usr/libexec/pacemaker/lrmd
>>> 6097 ? Ss 0:49 /usr/libexec/pacemaker/attrd
>>> 6098 ? Ss 0:25 /usr/libexec/pacemaker/pengine
>>> 6099 ? Ss 0:29 /usr/libexec/pacemaker/crmd
>>> ...
>>> 6666 ? S 0:25 pacemakerd
>>>
>>> but pacemakerd will be watching the old children and respawning them on failure.
>>> at which point you might see:
>>>
>>> 6094 ? Ss 116:13 /usr/libexec/pacemaker/cib
>>> 6096 ? Ss 1:27 /usr/libexec/pacemaker/lrmd
>>> 6097 ? Ss 0:49 /usr/libexec/pacemaker/attrd
>>> 6098 ? Ss 0:25 /usr/libexec/pacemaker/pengine
>>> 6099 ? Ss 0:29 /usr/libexec/pacemaker/crmd
>>> ...
>>> 6666 ? S 0:25 pacemakerd
>>> 6667 ? Ss 0:25 \_ /usr/libexec/pacemaker/stonithd
>>>> If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?
>>> If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.
>> Some people defend you, send me hate mail when I'm not restrained.
>
> You should see the mail I get off-list ;-)
>
>> But You're also a beetle :)
>
> I'm not 100% sure what you mean there.
>
>> Why you did not say anything about supporting upstart in spec?
>
> Mostly because I don't run it anywhere, so I have no idea what it does by default or can be configured to do.
> Its not malicious, the feature was simply written and tested in the context of systemd.
>
> Also, when I think upstart, I think debian based distros which don't use spec files ;-)
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list