[Pacemaker] hangs pending

Tue Mar 18 19:19:13 UTC 2014

12.03.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
> Sorry for the delay, sometimes it takes a while to rebuild the necessary context

I'm sorry too for the answer delay. 
I switched to using "upstart" for initializing corosync and pacemaker (with respawn).
Now the behavior of the system has changed and it suits me. (yet :) )
I must kill crmd/lrmd in infinite loop, then STONITH shoot.
Else very fast respawn and do nothing.

Of course, I still found a other way to hang the system. 
This requires only one idiot. 
1. He decides to update pacemaker (and/or erase incomprehensible service). 
2. Then kills the process corosync or simply reboot the server. 
Everything! This node will remain hang in "pending". 
And the worst thing ... if at least one node hangs in "pending" - does not work promote/demote and other manage a resources.

Yes, there is one oddity with long inclusion of some resources, but IMHO it is not very critical. 
I try correct a later, now I write documentation for project.

>
> On 5 Mar 2014, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  05.03.2014, 04:04, "Andrew Beekhof" <andrew at beekhof.net>:
>>>  On 25 Feb 2014, at 8:30 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   21.02.2014, 12:04, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>   21.02.2014, 05:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>    On 19 Feb 2014, at 7:53 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>     19.02.2014, 09:49, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>     On 19 Feb 2014, at 4:18 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>      19.02.2014, 09:08, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>      On 19 Feb 2014, at 4:00 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>       19.02.2014, 06:48, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>       On 18 Feb 2014, at 11:05 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>        Hi, ALL and Andrew!
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>>>>        In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>>>        Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>>>        I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>>>        Behavior does not depend of number signal - it's good.
>>>>>>>>>>>>>        If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>>>>        But the behavior is different from killing various demons.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Turned four groups:
>>>>>>>>>>>>>        1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>>>        Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>>>        Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>>>        Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>>>>>>>>>>>>        One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>>>        This daemons simple restart, resources - stay running.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        4. pacemakerd - nothing happens.
>>>>>>>>>>>>>        And then I can kill any process of the third group. They do not restart.
>>>>>>>>>>>>>        Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        What do you think about this?
>>>>>>>>>>>>>        The main question of this topic - we decided.
>>>>>>>>>>>>>        But this varied behavior - another big problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>        Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>>>       Which of the various conditions above do the logs cover?
>>>>>>>>>>>       All various in day.
>>>>>>>>>>      Are you trying to torture me?
>>>>>>>>>>      Can you give me a rough idea what happened when?
>>>>>>>>>      No, there is 8 processes on the 4th signal and repeats the experiments with unknown outcome :)
>>>>>>>>>      Easier to conduct new experiments and individual new logs .
>>>>>>>>>      Which variant is more interesting?
>>>>>>>>     The long delay in restarting pgsql.
>>>>>>>>     Everything else seems correct.
>>>>>>>     He even don't tried start pgsql.
>>>>>>>     In Logs tree the tests.
>>>>>>>     kill -s4 lrmd pid.
>>>>>>>     1. STONITH
>>>>>>>     2. STONITH
>>>>>>>     3. hangs
>>>>>>    Its waiting on a value for default_ping_set
>>>>>>
>>>>>>    It seems we're calling monitor for pingCheck but for some reason its not performing an update:
>>>>>>
>>>>>>    # grep 2632.*lrmd.*pingCheck /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>>>>    Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active resources)
>>>>>>    Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active resources)
>>>>>>    Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active resources)
>>>>>>    Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>>>>    Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>>>>    Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>>>>    Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>>>>    Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>>>>    Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>>>>    Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>>>>    Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty -- ]
>>>>>>    Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty -- ]
>>>>>>
>>>>>>    Could you add:
>>>>>>
>>>>>>      export OCF_TRACE_RA=1
>>>>>>
>>>>>>    to the top of the ping agent and retest?
>>>>>   Today the fourth time worked.
>>>>>   I even doubted if the difference is how to kill (kill -s 4 pid or pkill -4 lrmd)
>>>>>   Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>>>   Hi,
>>>>   You  haven't watched it?
>>>  Not yet. I've been hitting ACLs with a large hammer.
>>>  Where are we up to with this?  Do I disregard this one and look at the most recent email?
>>  Hi.
>>  No. These are two different cases.
>>  *   When after kill lrmd resources don't start.
>>  This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>
> Grumble, the logs are still useless...
>
> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456 - exited with rc=0
> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456:stderr [ -- empty -- ]
> ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456:stdout [ -- empty -- ]
>
> Can you just add the following to the top of the resource agent?
>
> set -x
>
>>  * When standby a entrie cluster (all nodes standby).
>>  Second node - hangs pending.
>>  But last rebuild rpm - not confirmed the problem.
>
> Ok, so potentially this is fixed with the latest git?
>
>>  Therefore, this problem can be considered as long as not a problem.
>>  http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2
>>
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org