[Pacemaker] hangs pending

Fri Jan 10 06:03:17 UTC 2014

10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:

>  On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>   08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>   On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>    Hi, ALL.
>>>>
>>>>    I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>   Please define "pending".  Where did you see this?
>>   In crm_mon:
>>   ......
>>   Node dev-cluster2-node2 (172793105): pending
>>   ......
>>
>>   The experiment was like this:
>>   Four nodes in cluster.
>>   On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>   Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>   Then in the log fell out "Too many failures ...."
>>   All this time in the status in crm_mon is "pending".
>>   Depending on the wind direction changed to "UNCLEAN"
>>   Much time has passed and I can not accurately describe the behavior...
>>
>>   Now I am in the following state:
>>   I tried locate the problem. Came here with this.
>>   I set big value in property stonith-timeout="600s".
>>   And got the following behavior:
>>   1. pkill -4 corosync
>>   2. from node with DC call my fence agent "sshbykey"
>>   3. It sends reboot victim and waits until she comes to life again.
>  Hmmm.... what version of pacemaker?
>  This sounds like a timing issue that we fixed a while back

Was a version 1.1.11 from December 3.
Now try full update and retest.

>>     Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0.
>>     All command is logged both the victim and the killer - all right.
>>   4. A little later, the status of the (victim) nodes in crm_mon changes to online.
>>   5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>>     * Start   pingCheck:3  (dev-cluster2-node2)
>>   6. In this state, we spend the next 600 seconds.
>>     After completing this timeout causes another node (not DC) decides to kill again our victim.
>>     All command again is logged both the victim and the killer - All documented :)
>>   7. NOW all resource started in right sequence.
>>
>>   I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>>   And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
>>
>>   I tried understood this behavior.
>>   As I understand it:
>>   1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
>>   2. It make fork and pipe from tham.
>>   3. Async call mainloop_child_add with callback to  stonith_action_async_done.
>>   4. Add timeout  g_timeout_add to TERM and KILL signals.
>>
>>   If all right must - call stonith_action_async_done, remove timeout.
>>   For some reason this does not happen. I sit and think ....
>>>>    At this time, there are constant re-election.
>>>>    Also, I noticed the difference when you start pacemaker.
>>>>    At normal startup:
>>>>    * corosync
>>>>    * pacemakerd
>>>>    * attrd
>>>>    * pengine
>>>>    * lrmd
>>>>    * crmd
>>>>    * cib
>>>>
>>>>    When hangs start:
>>>>    * corosync
>>>>    * pacemakerd
>>>>    * attrd
>>>>    * pengine
>>>>    * crmd
>>>>    * lrmd
>>>>    * cib.
>>>   Are you referring to the order of the daemons here?
>>>   The cib should not be at the bottom in either case.
>>>>    Who knows who runs lrmd?
>>>   Pacemakerd.
>>>>    _______________________________________________
>>>>    Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>    http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>    Project Home: http://www.clusterlabs.org
>>>>    Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>    Bugs: http://bugs.clusterlabs.org
>>>   ,
>>>   _______________________________________________
>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>   Project Home: http://www.clusterlabs.org
>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>   Bugs: http://bugs.clusterlabs.org
>>   _______________________________________________
>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>   Project Home: http://www.clusterlabs.org
>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>   Bugs: http://bugs.clusterlabs.org
>  ,
>  _______________________________________________
>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>  Project Home: http://www.clusterlabs.org
>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>  Bugs: http://bugs.clusterlabs.org