[Pacemaker] hangs pending

Andrey Groshev greenx at yandex.ru
Tue Mar 4 06:29:15 CET 2014


Good morning. 
Confused at night - attach another test log.

The point of this test - neatly turn off the cluster. 
That is sequentially send PCMK in standby.
And turn off the services (pacemaker&corosync). 
Then in reverse order. 
Sequences include services and infer from standby. 
The second node hangs on stage "pending" 
Most worryingly, the next nodes even in the status of "online" does not start services.
Morning logs - http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2

04.03.2014, 02:13, "Andrey Groshev" <greenx at yandex.ru>:
> Hi!
> I thought that all the bugs have already been caught. :)
> But today(already tonight) build last git PCMK with add upstart.
> And again catch hangs pending.
> logs http://send2me.ru/pcmk-04-Mar-2014.tar.bz2
>
> 24.02.2014, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
>
>>  On 22 Feb 2014, at 7:07 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>   21.02.2014, 04:00, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>   On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>    20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>    On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>     20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>     On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>      19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>      On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>       Hi, ALL and Andrew!
>>>>>>>>>>>
>>>>>>>>>>>       Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>>       In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>       Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>       I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>       Behavior does not depend of number signal - it's good.
>>>>>>>>>>>       If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>>       But the behavior is different from killing various demons.
>>>>>>>>>>>
>>>>>>>>>>>       Turned four groups:
>>>>>>>>>>>       1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>       Kill via any signals - call STONITH and reboot.
>>>>>>>>>>      excellent
>>>>>>>>>>>       3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>       This daemons simple restart, resources - stay running.
>>>>>>>>>>      right
>>>>>>>>>>>       2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>       Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>       Sometimes restart daemon
>>>>>>>>>>      The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>>>>>      If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>>>>>      If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>>>>>>
>>>>>>>>>>      So fence/not fence is normal and to be expected.
>>>>>>>>>>>       and restart resources with large delay MS:pgsql.
>>>>>>>>>>>       One time after restart crmd - pgsql don't restart.
>>>>>>>>>>      I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>>>>>      In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>>>>>>
>>>>>>>>>>      If the agent reports "all good", then there is nothing more to do.
>>>>>>>>>>      If the agent is not reporting "all good", you should really be asking why.
>>>>>>>>>>>       4. pacemakerd - nothing happens.
>>>>>>>>>>      On non-systemd based machines, correct.
>>>>>>>>>>
>>>>>>>>>>      On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>>>>>      Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>>>>>      And! I almost forgot about IT!
>>>>>>>>>      Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>>>>>      Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>>>>>      Otherwise - it's a full epic fail ;)
>>>>>>>>     -ENOPARSE
>>>>>>>     OK, I remove my personal attitude to "systemd".
>>>>>>>     Let me explain.
>>>>>>>
>>>>>>>     Somewhere in the beginning of this topic, I wrote:
>>>>>>>     A.G.:Who knows who runs lrmd?
>>>>>>>     A.B.:Pacemakerd.
>>>>>>>     That's one!
>>>>>>>
>>>>>>>     Let's see the list of processes:
>>>>>>>     #ps -axf
>>>>>>>     .....
>>>>>>>     6067 ?        Ssl    7:24 corosync
>>>>>>>     6092 ?        S      0:25 pacemakerd
>>>>>>>     6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>>>>>>     6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
>>>>>>>     6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
>>>>>>>     6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
>>>>>>>     6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
>>>>>>>     6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
>>>>>>>     .....
>>>>>>>     That's two!
>>>>>>    Whats two?  I don't follow.
>>>>>    In the sense that it creates other processes. But it does not matter.
>>>>>>>     And more, more...
>>>>>>>     Now you must understand - why I want this process to work always.
>>>>>>>     Even I think, No need for anyone here to explain it!
>>>>>>>
>>>>>>>     And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>>>>>>    No, I;m saying it works _better_ on systemd distros.
>>>>>>    On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>>>>>    Can you show me the place where:
>>>>>    "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?
>>>>   The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()
>>>>
>>>>   The ps tree will look different though
>>>>
>>>>    6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>>>>    6095 ?        Ss     0:25  /usr/libexec/pacemaker/stonithd
>>>>    6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>>>>    6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>>>>    6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>>>>    6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
>>>>   ...
>>>>    6666 ?        S      0:25 pacemakerd
>>>>
>>>>   but pacemakerd will be watching the old children and respawning them on failure.
>>>>   at which point you might see:
>>>>
>>>>    6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>>>>    6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>>>>    6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>>>>    6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>>>>    6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
>>>>   ...
>>>>    6666 ?        S      0:25 pacemakerd
>>>>    6667 ?        Ss     0:25 \_ /usr/libexec/pacemaker/stonithd
>>>>>    If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?
>>>>   If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.
>>>   Some people defend you, send me hate mail when I'm not restrained.
>>  You should see the mail I get off-list ;-)
>>>   But You're also a beetle :)
>>  I'm not 100% sure what you mean there.
>>>   Why you did not say anything about supporting upstart in spec?
>>  Mostly because I don't run it anywhere, so I have no idea what it does by default or can be configured to do.
>>  Its not malicious, the feature was simply written and tested in the context of systemd.
>>
>>  Also, when I think upstart, I think debian based distros which don't use spec files ;-)
>>
>>  ,
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



More information about the Pacemaker mailing list