[Pacemaker] hangs pending
Andrew Beekhof
andrew at beekhof.net
Tue Mar 11 22:47:32 UTC 2014
Sorry for the delay, sometimes it takes a while to rebuild the necessary context
On 5 Mar 2014, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>
>
> 05.03.2014, 04:04, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 25 Feb 2014, at 8:30 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>
>>> 21.02.2014, 12:04, "Andrey Groshev" <greenx at yandex.ru>:
>>>> 21.02.2014, 05:53, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>> On 19 Feb 2014, at 7:53 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>> 19.02.2014, 09:49, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>> On 19 Feb 2014, at 4:18 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>> 19.02.2014, 09:08, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>> On 19 Feb 2014, at 4:00 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>> 19.02.2014, 06:48, "Andrew Beekhof" <andrew at beekhof.net>:
>>>>>>>>>>> On 18 Feb 2014, at 11:05 pm, Andrey Groshev <greenx at yandex.ru> wrote:
>>>>>>>>>>>> Hi, ALL and Andrew!
>>>>>>>>>>>>
>>>>>>>>>>>> Today is a good day - I killed a lot, and a lot of shooting at me.
>>>>>>>>>>>> In general - I am happy (almost like an elephant) :)
>>>>>>>>>>>> Except resources on the node are important to me eight processes: corosync,pacemakerd,cib,stonithd,lrmd,attrd,pengine,crmd.
>>>>>>>>>>>> I killed them with different signals (4,6,11 and even 9).
>>>>>>>>>>>> Behavior does not depend of number signal - it's good.
>>>>>>>>>>>> If STONITH send reboot to the node - it rebooted and rejoined the cluster - too it's good.
>>>>>>>>>>>> But the behavior is different from killing various demons.
>>>>>>>>>>>>
>>>>>>>>>>>> Turned four groups:
>>>>>>>>>>>> 1. corosync,cib - STONITH work 100%.
>>>>>>>>>>>> Kill via any signals - call STONITH and reboot.
>>>>>>>>>>>>
>>>>>>>>>>>> 2. lrmd,crmd - strange behavior STONITH.
>>>>>>>>>>>> Sometimes called STONITH - and the corresponding reaction.
>>>>>>>>>>>> Sometimes restart daemon and restart resources with large delay MS:pgsql.
>>>>>>>>>>>> One time after restart crmd - pgsql don't restart.
>>>>>>>>>>>>
>>>>>>>>>>>> 3. stonithd,attrd,pengine - not need STONITH
>>>>>>>>>>>> This daemons simple restart, resources - stay running.
>>>>>>>>>>>>
>>>>>>>>>>>> 4. pacemakerd - nothing happens.
>>>>>>>>>>>> And then I can kill any process of the third group. They do not restart.
>>>>>>>>>>>> Generaly don't touch corosync,cib and maybe lrmd,crmd.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think about this?
>>>>>>>>>>>> The main question of this topic - we decided.
>>>>>>>>>>>> But this varied behavior - another big problem.
>>>>>>>>>>>>
>>>>>>>>>>>> Forgоt logs http://send2me.ru/pcmk-Tue-18-Feb-2014.tar.bz2
>>>>>>>>>>> Which of the various conditions above do the logs cover?
>>>>>>>>>> All various in day.
>>>>>>>>> Are you trying to torture me?
>>>>>>>>> Can you give me a rough idea what happened when?
>>>>>>>> No, there is 8 processes on the 4th signal and repeats the experiments with unknown outcome :)
>>>>>>>> Easier to conduct new experiments and individual new logs .
>>>>>>>> Which variant is more interesting?
>>>>>>> The long delay in restarting pgsql.
>>>>>>> Everything else seems correct.
>>>>>> He even don't tried start pgsql.
>>>>>> In Logs tree the tests.
>>>>>> kill -s4 lrmd pid.
>>>>>> 1. STONITH
>>>>>> 2. STONITH
>>>>>> 3. hangs
>>>>> Its waiting on a value for default_ping_set
>>>>>
>>>>> It seems we're calling monitor for pingCheck but for some reason its not performing an update:
>>>>>
>>>>> # grep 2632.*lrmd.*pingCheck /Users/beekhof/Downloads/pcmk-Wed-19-Feb-2014/dev-cluster2-node2.unix.tensor.ru/corosync.log
>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck' not found (3 active resources)
>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_get_rsc_info: Resource 'pingCheck:3' not found (3 active resources)
>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: info: process_lrmd_rsc_register: Added 'pingCheck' to the rsc list (4 active resources)
>>>>> Feb 19 10:49:58 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor call_id:19
>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658 - exited with rc=0
>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stderr [ -- empty -- ]
>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_0:2658:stdout [ -- empty -- ]
>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_finished: finished - rsc:pingCheck action:monitor call_id:19 pid:2658 exit-code:0 exec-time:2039ms queue-time:0ms
>>>>> Feb 19 10:50:00 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: log_execute: executing - rsc:pingCheck action:monitor call_id:20
>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_10000:2816 - exited with rc=0
>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_10000:2816:stderr [ -- empty -- ]
>>>>> Feb 19 10:50:02 [2632] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_monitor_10000:2816:stdout [ -- empty -- ]
>>>>>
>>>>> Could you add:
>>>>>
>>>>> export OCF_TRACE_RA=1
>>>>>
>>>>> to the top of the ping agent and retest?
>>>> Today the fourth time worked.
>>>> I even doubted if the difference is how to kill (kill -s 4 pid or pkill -4 lrmd)
>>>> Logs http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
>>> Hi,
>>> You haven't watched it?
>>
>> Not yet. I've been hitting ACLs with a large hammer.
>> Where are we up to with this? Do I disregard this one and look at the most recent email?
>>
>
> Hi.
> No. These are two different cases.
> * When after kill lrmd resources don't start.
> This http://send2me.ru/pcmk-Fri-21-Feb-2014.tar.bz2
Grumble, the logs are still useless...
./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456 - exited with rc=0
./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456:stderr [ -- empty -- ]
./dev-cluster2-node2.unix.tensor.ru/corosync.log:Feb 21 11:00:10 [26298] dev-cluster2-node2.unix.tensor.ru lrmd: debug: operation_finished: pingCheck_start_0:26456:stdout [ -- empty -- ]
Can you just add the following to the top of the resource agent?
set -x
>
> * When standby a entrie cluster (all nodes standby).
> Second node - hangs pending.
> But last rebuild rpm - not confirmed the problem.
Ok, so potentially this is fixed with the latest git?
> Therefore, this problem can be considered as long as not a problem.
> http://send2me.ru/pcmk-04-Mar-2014-2.tar.bz2
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140312/9c7b2bd5/attachment-0004.sig>
More information about the Pacemaker
mailing list