[Pacemaker] hangs pending

Wed Mar 19 05:00:56 UTC 2014

19.03.2014, 03:29, "Andrew Beekhof" <andrew at beekhof.net>:
> On 19 Mar 2014, at 6:19 am, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  12.03.2014, 02:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>  Sorry for the delay, sometimes it takes a while to rebuild the necessary context
>>  I'm sorry too for the answer delay.
>>  I switched to using "upstart" for initializing corosync and pacemaker (with respawn).
>>  Now the behavior of the system has changed and it suits me. (yet :) )
>>  I must kill crmd/lrmd in infinite loop, then STONITH shoot.
>>  Else very fast respawn and do nothing.
>>
>>  Of course, I still found a other way to hang the system.
>>  This requires only one idiot.
>>  1. He decides to update pacemaker (and/or erase incomprehensible service).
>>  2. Then kills the process corosync or simply reboot the server.
>>  Everything! This node will remain hang in "pending".
>
> While trying to shutdown?
> Our spec files shut pacemaker down prior to upgrades FWIW.

Not so simple ... we have a national tradition - care and cherish idiots. 
Therefore, they are clever, quirky and unpredictable. ;)
He can simply delete files of package, without uninstall. 
(In reality, it may be just crash of the file system).

>
>>  And the worst thing ... if at least one node hangs in "pending" - does not work promote/demote and other manage a resources.
>>
>>  Yes, there is one oddity with long inclusion of some resources, but IMHO it is not very critical.
>>  I try correct a later, now I write documentation for project.
>>>  On 5 Mar 2014, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>   05.03.2014, 04:04, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>   On 25 Feb 2014, at 8:30 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>    21.02.2014, 12:04, "Andrey Groshev" <greenx at yandex.ru>:
>>>>>>>    21.02.2014, 05:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>     On 19 Feb 2014, at 7:53 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>      19.02.2014, 09:49, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>      On 19 Feb 2014, at 4:18 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>       19.02.2014, 09:08, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>       On 19 Feb 2014, at 4:00 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>        19.02.2014, 06:48, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>>>>>        On 18 Feb 2014, at 11:05 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>>>>>         Hi, ALL and Andrew!
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>>>>>>         In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>>>>>         Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>>>>>         I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>>>>>         Behavior does not depend of number signal - it's good.
>>>>>>>>>>>>>>>         If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>>>>>>         But the behavior is different from killing various demons.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         Turned four groups:
>>>>>>>>>>>>>>>         1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>>>>>         Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>>>>>         Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>>>>>         Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>>>>>>>>>>>>>>         One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>>>>>         This daemons simple restart, resources - stay running.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         4. pacemakerd - nothing happens.
>>>>>>>>>>>>>>>         And then I can kill any process of the third group. They do not restart.
>>>>>>>>>>>>>>>         Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         What do you think about this?
>>>>>>>>>>>>>>>         The main question of this topic - we decided.
>>>>>>>>>>>>>>>         But this varied behavior - another big problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>         Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>>>>>        Which of the various conditions above do the logs cover?
>>>>>>>>>>>>>        All various in day.
>>>>>>>>>>>>       Are you trying to torture me?
>>>>>>>>>>>>       Can you give me a rough idea what happened when?
>>>>>>>>>>>       No, there is 8 processes on the 4th signal and repeats the experiments with unknown outcome :)
>>>>>>>>>>>       Easier to conduct new experiments and individual new logs .
>>>>>>>>>>>       Which variant is more interesting?
>>>>>>>>>>      The long delay in restarting pgsql.
>>>>>>>>>>      Everything else seems correct.
>>>>>>>>>      He even don't tried start pgsql.
>>>>>>>>>      In Logs tree the tests.
>>>>>>>>>      kill -s4 lrmd pid.
>>>>>>>>>      1. STONITH
>>>>>>>>>      2. STONITH
>>>>>>>>>      3. hangs
>>>>>>>>     Its waiting on a value for default_ping_set
>>>>>>>>
>>>>>>>>     It seems we're calling monitor for pingCheck but for some reason its not performing an update:
>>>>>>>>
>>>>>>>>     # grep 2632.*lrmd.*pingCheck /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>>>>>>     Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active resources)
>>>>>>>>     Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active resources)
>>>>>>>>     Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active resources)
>>>>>>>>     Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>>>>>>     Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>>>>>>     Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>>>>>>     Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>>>>>>     Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>>>>>>     Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>>>>>>     Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>>>>>>     Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty -- ]
>>>>>>>>     Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty -- ]
>>>>>>>>
>>>>>>>>     Could you add:
>>>>>>>>
>>>>>>>>       export OCF_TRACE_RA=1
>>>>>>>>
>>>>>>>>     to the top of the ping agent and retest?
>>>>>>>    Today the fourth time worked.
>>>>>>>    I even doubted if the difference is how to kill (kill -s 4 pid or pkill -4 lrmd)
>>>>>>>    Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>>>>>    Hi,
>>>>>>    You  haven't watched it?
>>>>>   Not yet. I've been hitting ACLs with a large hammer.
>>>>>   Where are we up to with this?  Do I disregard this one and look at the most recent email?
>>>>   Hi.
>>>>   No. These are two different cases.
>>>>   *   When after kill lrmd resources don't start.
>>>>   This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>>  Grumble, the logs are still useless...
>>>
>>>  ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456 - exited with rc=0
>>>  ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456:stderr [ -- empty -- ]
>>>  ./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_start_0:26456:stdout [ -- empty -- ]
>>>
>>>  Can you just add the following to the top of the resource agent?
>>>
>>>  set -x
>>>>   * When standby a entrie cluster (all nodes standby).
>>>>   Second node - hangs pending.
>>>>   But last rebuild rpm - not confirmed the problem.
>>>  Ok, so potentially this is fixed with the latest git?
>>>>   Therefore, this problem can be considered as long as not a problem.
>>>>   http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2
>>>>
>>>>   _______________________________________________
>>>>   Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>   http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>   Project Home: http://www.clusterlabs.org
>>>>   Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>   Bugs: http://bugs.clusterlabs.org
>>>  ,
>>>  _______________________________________________
>>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>>  Project Home: http://www.clusterlabs.org
>>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>  Bugs: http://bugs.clusterlabs.org
>>  _______________________________________________
>>  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>>  Project Home: http://www.clusterlabs.org
>>  Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>  Bugs: http://bugs.clusterlabs.org
>
> ,
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org