[Pacemaker] hangs pending

Wed Mar 5 05:42:07 UTC 2014

05.03.2014, 04:04, "Andrew Beekhof" <andrew at beekhof.net>:
> On 25 Feb 2014, at 8:30 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>>  21.02.2014, 12:04, "Andrey Groshev" <greenx at yandex.ru>:
>>>  21.02.2014, 05:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>   On 19 Feb 2014, at 7:53 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>    19.02.2014, 09:49, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>    On 19 Feb 2014, at 4:18 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>     19.02.2014, 09:08, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>     On 19 Feb 2014, at 4:00 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>      19.02.2014, 06:48, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>      On 18 Feb 2014, at 11:05 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>       Hi, ALL and Andrew!
>>>>>>>>>>>
>>>>>>>>>>>       Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>>       In general - I am happy (almost like an elephant)   :)
>>>>>>>>>>>       Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>       I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>       Behavior does not depend of number signal - it's good.
>>>>>>>>>>>       If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>>       But the behavior is different from killing various demons.
>>>>>>>>>>>
>>>>>>>>>>>       Turned four groups:
>>>>>>>>>>>       1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>       Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>
>>>>>>>>>>>       2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>       Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>       Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>>>>>>>>>>       One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>
>>>>>>>>>>>       3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>       This daemons simple restart, resources - stay running.
>>>>>>>>>>>
>>>>>>>>>>>       4. pacemakerd - nothing happens.
>>>>>>>>>>>       And then I can kill any process of the third group. They do not restart.
>>>>>>>>>>>       Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>
>>>>>>>>>>>       What do you think about this?
>>>>>>>>>>>       The main question of this topic - we decided.
>>>>>>>>>>>       But this varied behavior - another big problem.
>>>>>>>>>>>
>>>>>>>>>>>       Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>      Which of the various conditions above do the logs cover?
>>>>>>>>>      All various in day.
>>>>>>>>     Are you trying to torture me?
>>>>>>>>     Can you give me a rough idea what happened when?
>>>>>>>     No, there is 8 processes on the 4th signal and repeats the experiments with unknown outcome :)
>>>>>>>     Easier to conduct new experiments and individual new logs .
>>>>>>>     Which variant is more interesting?
>>>>>>    The long delay in restarting pgsql.
>>>>>>    Everything else seems correct.
>>>>>    He even don't tried start pgsql.
>>>>>    In Logs tree the tests.
>>>>>    kill -s4 lrmd pid.
>>>>>    1. STONITH
>>>>>    2. STONITH
>>>>>    3. hangs
>>>>   Its waiting on a value for default_ping_set
>>>>
>>>>   It seems we're calling monitor for pingCheck but for some reason its not performing an update:
>>>>
>>>>   # grep 2632.*lrmd.*pingCheck /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active resources)
>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active resources)
>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:     info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active resources)
>>>>   Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>>   Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty -- ]
>>>>   Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru       lrmd:    debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty -- ]
>>>>
>>>>   Could you add:
>>>>
>>>>     export OCF_TRACE_RA=1
>>>>
>>>>   to the top of the ping agent and retest?
>>>  Today the fourth time worked.
>>>  I even doubted if the difference is how to kill (kill -s 4 pid or pkill -4 lrmd)
>>>  Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>  Hi,
>>  You  haven't watched it?
>
> Not yet. I've been hitting ACLs with a large hammer.
> Where are we up to with this?  Do I disregard this one and look at the most recent email?
>

Hi.
No. These are two different cases.
*   When after kill lrmd resources don't start.
This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2

* When standby a entrie cluster (all nodes standby). 
Second node - hangs pending.
But last rebuild rpm - not confirmed the problem.
Therefore, this problem can be considered as long as not a problem.
http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2