[Pacemaker] hangs pending
Andrey Groshev
greenx at yandex.ru
Fri Jan 10 06:03:17 UTC 2014
10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>> 08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>> Hi, ALL.
>>>>
>>>> I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>> Please define "pending". Where did you see this?
>> In crm_mon:
>> ......
>> Node dev-cluster2-node2 (172793105): pending
>> ......
>>
>> The experiment was like this:
>> Four nodes in cluster.
>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>> Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>> Then in the log fell out "Too many failures ...."
>> All this time in the status in crm_mon is "pending".
>> Depending on the wind direction changed to "UNCLEAN"
>> Much time has passed and I can not accurately describe the behavior...
>>
>> Now I am in the following state:
>> I tried locate the problem. Came here with this.
>> I set big value in property stonith-timeout="600s".
>> And got the following behavior:
>> 1. pkill -4 corosync
>> 2. from node with DC call my fence agent "sshbykey"
>> 3. It sends reboot victim and waits until she comes to life again.
> Hmmm.... what version of pacemaker?
> This sounds like a timing issue that we fixed a while back
Was a version 1.1.11 from December 3.
Now try full update and retest.
>> Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0.
>> All command is logged both the victim and the killer - all right.
>> 4. A little later, the status of the (victim) nodes in crm_mon changes to online.
>> 5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>> * Start pingCheck:3 (dev-cluster2-node2)
>> 6. In this state, we spend the next 600 seconds.
>> After completing this timeout causes another node (not DC) decides to kill again our victim.
>> All command again is logged both the victim and the killer - All documented :)
>> 7. NOW all resource started in right sequence.
>>
>> I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>> And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
>>
>> I tried understood this behavior.
>> As I understand it:
>> 1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
>> 2. It make fork and pipe from tham.
>> 3. Async call mainloop_child_add with callback to stonith_action_async_done.
>> 4. Add timeout g_timeout_add to TERM and KILL signals.
>>
>> If all right must - call stonith_action_async_done, remove timeout.
>> For some reason this does not happen. I sit and think ....
>>>> At this time, there are constant re-election.
>>>> Also, I noticed the difference when you start pacemaker.
>>>> At normal startup:
>>>> * corosync
>>>> * pacemakerd
>>>> * attrd
>>>> * pengine
>>>> * lrmd
>>>> * crmd
>>>> * cib
>>>>
>>>> When hangs start:
>>>> * corosync
>>>> * pacemakerd
>>>> * attrd
>>>> * pengine
>>>> * crmd
>>>> * lrmd
>>>> * cib.
>>> Are you referring to the order of the daemons here?
>>> The cib should not be at the bottom in either case.
>>>> Who knows who runs lrmd?
>>> Pacemakerd.
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> ,
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list