[Pacemaker] hangs pending
Andrey Groshev
greenx at yandex.ru
Fri Jan 10 10:55:39 UTC 2014
10.01.2014, 14:31, "Andrey Groshev" <greenx at yandex.ru>:
> 10.01.2014, 14:01, "Andrew Beekhof" <andrew at beekhof.net>:
>
>> On 10 Jan 2014, at 5:03 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>> 10.01.2014, 05:29, "Andrew Beekhof" <andrew at beekhof.net>:
>>>> On 9 Jan 2014, at 11:11 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>> 08.01.2014, 06:22, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>> On 29 Nov 2013, at 7:17 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>> Hi, ALL.
>>>>>>>
>>>>>>> I'm still trying to cope with the fact that after the fence - node hangs in "pending".
>>>>>> Please define "pending". Where did you see this?
>>>>> In crm_mon:
>>>>> ......
>>>>> Node dev-cluster2-node2 (172793105): pending
>>>>> ......
>>>>>
>>>>> The experiment was like this:
>>>>> Four nodes in cluster.
>>>>> On one of them kill corosync or pacemakerd (signal 4 or 6 oк 11).
>>>>> Thereafter, the remaining start it constantly reboot, under various pretexts, "softly whistling", "fly low", "not a cluster member!" ...
>>>>> Then in the log fell out "Too many failures ...."
>>>>> All this time in the status in crm_mon is "pending".
>>>>> Depending on the wind direction changed to "UNCLEAN"
>>>>> Much time has passed and I can not accurately describe the behavior...
>>>>>
>>>>> Now I am in the following state:
>>>>> I tried locate the problem. Came here with this.
>>>>> I set big value in property stonith-timeout="600s".
>>>>> And got the following behavior:
>>>>> 1. pkill -4 corosync
>>>>> 2. from node with DC call my fence agent "sshbykey"
>>>>> 3. It sends reboot victim and waits until she comes to life again.
>>>> Hmmm.... what version of pacemaker?
>>>> This sounds like a timing issue that we fixed a while back
>>> Was a version 1.1.11 from December 3.
>>> Now try full update and retest.
>> That should be recent enough. Can you create a crm_report the next time you reproduce?
>
> Of course yes. Little delay.... :)
>
> ......
> cc1: warnings being treated as errors
> upstart.c: In function ‘upstart_job_property’:
> upstart.c:264: error: implicit declaration of function ‘g_variant_lookup_value’
> upstart.c:264: error: nested extern declaration of ‘g_variant_lookup_value’
> upstart.c:264: error: assignment makes pointer from integer without a cast
> gmake[2]: *** [libcrmservice_la-upstart.lo] Error 1
> gmake[2]: Leaving directory `/root/ha/pacemaker/lib/services'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/root/ha/pacemaker/lib'
> make: *** [core] Error 1
>
> I'm trying to solve this a problem.
Do not get solved quickly...
https://developer.gnome.org/glib/2.28/glib-GVariant.html#g-variant-lookup-value
g_variant_lookup_value () Since 2.28
# yum list installed glib2
Loaded plugins: fastestmirror, rhnplugin, security
This system is receiving updates from RHN Classic or Red Hat Satellite.
Loading mirror speeds from cached hostfile
Installed Packages
glib2.x86_64 2.26.1-3.el6 installed
# cat /etc/issue
CentOS release 6.5 (Final)
Kernel \r on an \m
>>>>> Once the script makes sure that the victim will rebooted and again available via ssh - it exit with 0.
>>>>> All command is logged both the victim and the killer - all right.
>>>>> 4. A little later, the status of the (victim) nodes in crm_mon changes to online.
>>>>> 5. BUT... not one resource don't start! Despite the fact that "crm_simalate -sL" shows the correct resource to start:
>>>>> * Start pingCheck:3 (dev-cluster2-node2)
>>>>> 6. In this state, we spend the next 600 seconds.
>>>>> After completing this timeout causes another node (not DC) decides to kill again our victim.
>>>>> All command again is logged both the victim and the killer - All documented :)
>>>>> 7. NOW all resource started in right sequence.
>>>>>
>>>>> I almost happy, but I do not like: two reboots and 10 minutes of waiting ;)
>>>>> And if something happens on another node, this the behavior is superimposed on old and not any resources not start until the last node will not reload twice.
>>>>>
>>>>> I tried understood this behavior.
>>>>> As I understand it:
>>>>> 1. Ultimately, in ./lib/fencing/st_client.c call internal_stonith_action_execute().
>>>>> 2. It make fork and pipe from tham.
>>>>> 3. Async call mainloop_child_add with callback to stonith_action_async_done.
>>>>> 4. Add timeout g_timeout_add to TERM and KILL signals.
>>>>>
>>>>> If all right must - call stonith_action_async_done, remove timeout.
>>>>> For some reason this does not happen. I sit and think ....
>>>>>>> At this time, there are constant re-election.
>>>>>>> Also, I noticed the difference when you start pacemaker.
>>>>>>> At normal startup:
>>>>>>> * corosync
>>>>>>> * pacemakerd
>>>>>>> * attrd
>>>>>>> * pengine
>>>>>>> * lrmd
>>>>>>> * crmd
>>>>>>> * cib
>>>>>>>
>>>>>>> When hangs start:
>>>>>>> * corosync
>>>>>>> * pacemakerd
>>>>>>> * attrd
>>>>>>> * pengine
>>>>>>> * crmd
>>>>>>> * lrmd
>>>>>>> * cib.
>>>>>> Are you referring to the order of the daemons here?
>>>>>> The cib should not be at the bottom in either case.
>>>>>>> Who knows who runs lrmd?
>>>>>> Pacemakerd.
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> ,
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> ,
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> ,
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list