[Pacemaker] hangs pending

Mon Mar 3 17:08:17 EST 2014

Hi! 
I thought that all the bugs have already been caught. :)
But today(already tonight) build last git PCMK with add upstart.
And again catch hangs pending.
logs http://send2me.ru/pcmk-04-Mar-2014.tar.bz2

24.02.2014, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
> On 22 Feb 2014, at 7:07 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  21.02.2014, 04:00, "Andrew Beekhof" <andrew at beekhof.net>:
>>>  On 20 Feb 2014, at 10:04 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   20.02.2014, 13:57, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>   On 20 Feb 2014, at 5:33 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>    20.02.2014, 01:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>    On 20 Feb 2014, at 4:18 am, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>     19.02.2014, 06:47, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>     On 18 Feb 2014, at 9:29 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>      Hi, ALL and Andrew!
>>>>>>>>>>
>>>>>>>>>>      Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>      In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>      Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>      I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>      Behavior does not depend of number signal - it's good.
>>>>>>>>>>      If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>      But the behavior is different from killing various demons.
>>>>>>>>>>
>>>>>>>>>>      Turned four groups:
>>>>>>>>>>      1. corosync,cib - STONITH work 100%.
>>>>>>>>>>      Kill via any signals - call STONITH and reboot.
>>>>>>>>>     excellent
>>>>>>>>>>      3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>      This daemons simple restart, resources - stay running.
>>>>>>>>>     right
>>>>>>>>>>      2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>      Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>      Sometimes restart daemon
>>>>>>>>>     The daemon will always try to restart, the only variable is how long it takes the peer to notice and initiate fencing.
>>>>>>>>>     If the failure happens just before a they're due to receive totem token, the failure will be very quickly detected and the node fenced.
>>>>>>>>>     If the failure happens just after, then detection will take longer - giving the node longer to recover and not be fenced.
>>>>>>>>>
>>>>>>>>>     So fence/not fence is normal and to be expected.
>>>>>>>>>>      and restart resources with large delay MS:pgsql.
>>>>>>>>>>      One time after restart crmd - pgsql don't restart.
>>>>>>>>>     I would not expect pgsql to ever restart - if the RA does its job properly anyway.
>>>>>>>>>     In the case the node is not fenced, the crmd will respawn and the the PE will request that it re-detect the state of all resources.
>>>>>>>>>
>>>>>>>>>     If the agent reports "all good", then there is nothing more to do.
>>>>>>>>>     If the agent is not reporting "all good", you should really be asking why.
>>>>>>>>>>      4. pacemakerd - nothing happens.
>>>>>>>>>     On non-systemd based machines, correct.
>>>>>>>>>
>>>>>>>>>     On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons.
>>>>>>>>>     Any subsequent daemon failure will be detected and the daemon respawned.
>>>>>>>>     And! I almost forgot about IT!
>>>>>>>>     Exist another (NORMAL) the variants, the methods, the ideas?
>>>>>>>>     Without this  ... @$%#$%&$%^&$%^&##@#$$^$%& !!!!!
>>>>>>>>     Otherwise - it's a full epic fail ;)
>>>>>>>    -ENOPARSE
>>>>>>    OK, I remove my personal attitude to "systemd".
>>>>>>    Let me explain.
>>>>>>
>>>>>>    Somewhere in the beginning of this topic, I wrote:
>>>>>>    A.G.:Who knows who runs lrmd?
>>>>>>    A.B.:Pacemakerd.
>>>>>>    That's one!
>>>>>>
>>>>>>    Let's see the list of processes:
>>>>>>    #ps -axf
>>>>>>    .....
>>>>>>    6067 ?        Ssl    7:24 corosync
>>>>>>    6092 ?        S      0:25 pacemakerd
>>>>>>    6094 ?        Ss   116:13  \_ /usr/libexec/pacemaker/cib
>>>>>>    6095 ?        Ss     0:25  \_ /usr/libexec/pacemaker/stonithd
>>>>>>    6096 ?        Ss     1:27  \_ /usr/libexec/pacemaker/lrmd
>>>>>>    6097 ?        Ss     0:49  \_ /usr/libexec/pacemaker/attrd
>>>>>>    6098 ?        Ss     0:25  \_ /usr/libexec/pacemaker/pengine
>>>>>>    6099 ?        Ss     0:29  \_ /usr/libexec/pacemaker/crmd
>>>>>>    .....
>>>>>>    That's two!
>>>>>   Whats two?  I don't follow.
>>>>   In the sense that it creates other processes. But it does not matter.
>>>>>>    And more, more...
>>>>>>    Now you must understand - why I want this process to work always.
>>>>>>    Even I think, No need for anyone here to explain it!
>>>>>>
>>>>>>    And Now you say about "pacemakerd nice work, but only on systemd distros" !!!
>>>>>   No, I;m saying it works _better_ on systemd distros.
>>>>>   On non-systemd distros you still need quite a few unlikely-to-happen failures to trigger a situation in which the node still gets fenced and recovered (assuming no-one saw any of the error messages and didn't run "service pacemaker restart" prior to the additional failures).
>>>>   Can you show me the place where:
>>>>   "On a systemd based machine pacemakerd is respawned and reattaches to the existing daemons."?
>>>  The code for it is in mcp/pacemaker.c, look for find_and_track_existing_processes()
>>>
>>>  The ps tree will look different though
>>>
>>>   6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>>>   6095 ?        Ss     0:25  /usr/libexec/pacemaker/stonithd
>>>   6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>>>   6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>>>   6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>>>   6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
>>>  ...
>>>   6666 ?        S      0:25 pacemakerd
>>>
>>>  but pacemakerd will be watching the old children and respawning them on failure.
>>>  at which point you might see:
>>>
>>>   6094 ?        Ss   116:13  /usr/libexec/pacemaker/cib
>>>   6096 ?        Ss     1:27  /usr/libexec/pacemaker/lrmd
>>>   6097 ?        Ss     0:49  /usr/libexec/pacemaker/attrd
>>>   6098 ?        Ss     0:25  /usr/libexec/pacemaker/pengine
>>>   6099 ?        Ss     0:29  /usr/libexec/pacemaker/crmd
>>>  ...
>>>   6666 ?        S      0:25 pacemakerd
>>>   6667 ?        Ss     0:25 \_ /usr/libexec/pacemaker/stonithd
>>>>   If I respawn via upstart process pacemakerd - "reattaches to the existing daemons" ?
>>>  If upstart is capable of detecting the pacemakerd failure and automagically respawning it, then yes - the same process will happen.
>>  Some people defend you, send me hate mail when I'm not restrained.
>
> You should see the mail I get off-list ;-)
>
>>  But You're also a beetle :)
>
> I'm not 100% sure what you mean there.
>
>>  Why you did not say anything about supporting upstart in spec?
>
> Mostly because I don't run it anywhere, so I have no idea what it does by default or can be configured to do.
> Its not malicious, the feature was simply written and tested in the context of systemd.
>
> Also, when I think upstart, I think debian based distros which don't use spec files ;-)
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org