[Pacemaker] pcmk_shutdown: Still waiting for crmd

Mon Dec 12 23:52:08 UTC 2011

Hi sorry for my late answer.

I activate debug mode and this is the output during shutdown the node:

Dec 13 00:45:56 san1 kernel: [  361.100051] INFO: task clvmd:3291 blocked for more than 120 seconds.
Dec 13 00:45:56 san1 kernel: [  361.100405] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 13 00:45:56 san1 kernel: [  361.100758] clvmd         D ffff88043e51b170     0  3291      1 0x00000000
Dec 13 00:45:56 san1 kernel: [  361.100763]  ffff88043e51b170 0000000000000082 000000033e1b2118 ffff88000fa157e8
Dec 13 00:45:56 san1 kernel: [  361.100766]  ffff88043d7b0e58 ffff88000fa15780 000000000000f9e0 ffff88042469dfd8
Dec 13 00:45:56 san1 kernel: [  361.100770]  0000000000015780 0000000000015780 ffff88043d7b0e20 ffff88043d7b1118
Dec 13 00:45:56 san1 kernel: [  361.100773] Call Trace:
Dec 13 00:45:56 san1 kernel: [  361.100784]  [<ffffffff810482ed>] ? finish_task_switch+0x3a/0xaf
Dec 13 00:45:56 san1 kernel: [  361.100790]  [<ffffffff812fb73d>] ? schedule_timeout+0x2e/0xdd
Dec 13 00:45:56 san1 kernel: [  361.100797]  [<ffffffff810e5d79>] ? virt_to_head_page+0x9/0x2a
Dec 13 00:45:56 san1 kernel: [  361.100800]  [<ffffffff812fb5f4>] ? wait_for_common+0xde/0x15b
Dec 13 00:45:56 san1 kernel: [  361.100803]  [<ffffffff8104a4cc>] ? default_wake_function+0x0/0x9
Dec 13 00:45:56 san1 kernel: [  361.100818]  [<ffffffffa020096c>] ? dlm_new_lockspace+0x783/0x8da [dlm]
Dec 13 00:45:56 san1 kernel: [  361.100822]  [<ffffffff810e81a5>] ? __kmalloc+0x12f/0x141
Dec 13 00:45:56 san1 kernel: [  361.100829]  [<ffffffffa0206c67>] ? device_write+0x420/0x5f2 [dlm]
Dec 13 00:45:56 san1 kernel: [  361.100834]  [<ffffffff810cd128>] ? handle_mm_fault+0x7aa/0x80f
Dec 13 00:45:56 san1 kernel: [  361.100839]  [<ffffffff810ef544>] ? vfs_write+0xa9/0x102
Dec 13 00:45:56 san1 kernel: [  361.100842]  [<ffffffff810ef659>] ? sys_write+0x45/0x6e
Dec 13 00:45:56 san1 kernel: [  361.100845]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b
Dec 13 00:46:06 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4716)
Dec 13 00:46:06 san1 lrmd: [4716]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:06 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4716 exited with return code 0
Dec 13 00:46:16 san1 corosync[2571]:   [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd (pid=2600, seq=6) to terminate...
Dec 13 00:46:16 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4724)
Dec 13 00:46:16 san1 lrmd: [4724]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:16 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4724 exited with return code 0
Dec 13 00:46:26 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4732)
Dec 13 00:46:26 san1 lrmd: [4732]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:26 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4732 exited with return code 0
Dec 13 00:46:36 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4740)
Dec 13 00:46:36 san1 lrmd: [4740]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:36 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4740 exited with return code 0
Dec 13 00:46:46 san1 corosync[2571]:   [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd (pid=2600, seq=6) to terminate...
Dec 13 00:46:46 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4748)
Dec 13 00:46:46 san1 lrmd: [4748]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:46 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4748 exited with return code 0
Dec 13 00:46:56 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4759)
Dec 13 00:46:56 san1 lrmd: [4759]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:46:56 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4759 exited with return code 0
Dec 13 00:47:06 san1 lrmd: [2597]: debug: rsc:dlm:1 monitor[12] (pid 4767)
Dec 13 00:47:06 san1 lrmd: [4767]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Dec 13 00:47:06 san1 lrmd: [2597]: info: operation monitor[12] on dlm:1 for client 2600: pid 4767 exited with return code 0

Regards,
Erik

----- Ursprüngliche Mail -----
Von: "Andrew Beekhof" <andrew at beekhof.net>
An: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
Gesendet: Donnerstag, 8. Dezember 2011 04:15:46
Betreff: Re: [Pacemaker] pcmk_shutdown: Still waiting for crmd

On Wed, Dec 7, 2011 at 9:44 PM, Erik Schwalbe <erik.schwalbe at canoo.com> wrote:
> Hi,
>
> I think there is never an attempt to demote it.
> kern.log:
>
> 11:35:58 ubuntu1 kernel: Kernel logging (proc) stopped.
>
> corosync.log:
> 11:35:59 ubuntu1 crmd: [839]: info: crm_signal_dispatch: Invoking handler for signal 15: Terminated
> 11:35:59 ubuntu1 crmd: [839]: info: crm_shutdown: Requesting shutdown
> 11:35:59 ubuntu1 crmd: [839]: notice: crm_shutdown: Forcing shutdown in: 1200000ms
>
> after that, there is no entry in kern.log

Sounds strange.
Can you file a bug and attach a crm_report dating from the time you
initiated shutdown please?

>
> Regards,
> Erik
>
>
> ----- Ursprüngliche Mail -----
> Von: "Andreas Kurz" <andreas at hastexo.com>
> An: pacemaker at oss.clusterlabs.org
> Gesendet: Mittwoch, 7. Dezember 2011 11:06:14
> Betreff: Re: [Pacemaker] pcmk_shutdown: Still waiting for crmd
>
> On 12/07/2011 10:27 AM, Erik Schwalbe wrote:
>> Hi,
>>
>> I built a test cluster with 2 nodes.
>> Ubuntu 10.4.3 LTS with *ppa:ubuntu-ha-maintainers/ppa*
>>
>> corosync 1.4.2
>> pacemaker 1.1.6
>>
>> primitive clvm ocf:lvm2:clvmd \
>>         params daemon_timeout="30" \
>>         operations $id="clvm-operations" \
>>         op start interval="0" timeout="90" \
>>         op stop interval="0" timeout="100" \
>>         op monitor interval="0" timeout="20" start-delay="0" \
>>         meta target-role="started"
>> primitive data ocf:heartbeat:LVM \
>>         params volgrpname="data" \
>>         operations $id="data-operations" \
>>         op start interval="0" timeout="30" \
>>         op stop interval="0" timeout="30" \
>>         op monitor interval="10" timeout="120" start-delay="0" \
>>         op methods interval="0" timeout="5" \
>>         meta target-role="started"
>> primitive dlm ocf:pacemaker:controld \
>>         operations $id="dlm-operations" \
>>         op start interval="0" timeout="90" \
>>         op stop interval="0" timeout="100" \
>>         op monitor interval="10" timeout="20" start-delay="0" \
>>         meta target-role="started"
>> primitive fs ocf:heartbeat:Filesystem \
>>         params device="/dev/data/test" directory="/data/test"
>> fstype="ocfs2" \
>>         operations $id="fs-operations" \
>>         op start interval="0" timeout="60" \
>>         op stop interval="0" timeout="60" \
>>         op monitor interval="120" timeout="40" start-delay="0" \
>>         op notify interval="0" timeout="60" \
>>         meta target-role="started"
>> primitive o2cb ocf:pacemaker:o2cb \
>>         operations $id="o2cb-operations" \
>>         op start interval="0" timeout="90" \
>>         op stop interval="0" timeout="100" \
>>         op monitor interval="0" timeout="20" start-delay="0" \
>>         meta target-role="started"
>> primitive res_DRBD ocf:linbit:drbd \
>>         params drbd_resource="r0" \
>>         operations $id="res_DRBD-operations" \
>>         op start interval="0" timeout="240" \
>>         op promote interval="0" timeout="90" \
>>         op demote interval="0" timeout="90" \
>>         op stop interval="0" timeout="100" \
>>         op monitor interval="30" timeout="20" start-delay="1min" \
>>         op notify interval="0" timeout="90" \
>>         meta target-role="started"
>> group dlm-clvm dlm clvm
>> ms ms_DRBD res_DRBD \
>>         meta master-max="2" clone-max="2" notify="true" interleave="true"
>> clone clone_data data \
>>         meta clone-max="2" ordered="true" interleave="true"
>> clone dlm-clvm-clone dlm-clvm \
>>         meta interleave="true" ordered="true"
>> clone fs-clone fs \
>>         meta clone-max="2" ordered="true" interleave="true"
>> clone o2cb-clone o2cb \
>>         meta clone-max="2" interleave="true"
>> colocation col_data_clvm-dlm-clone inf: clone_data dlm-clvm-clone
>> colocation col_fs_o2cb inf: fs-clone o2cb-clone
>> colocation col_ms_DRBD_dlm-clvm-clone inf: dlm-clvm-clone ms_DRBD:Master
>> colocation col_o2cb_dlm-clvm inf: o2cb-clone dlm-clvm-clone
>> order ord_data_after_clvm-dlm-clone inf: dlm-clvm-clone clone_data
>> order ord_ms_DRBD_dlm-clvm-clone inf: ms_DRBD:promote dlm-clvm-clone:start
>> order ord_o2cb_after_dlm-clvm 0: dlm-clvm-clone o2cb-clone
>> order ord_o2cb_fs inf: o2cb-clone fs-clone
>> property $id="cib-bootstrap-options" \
>>         dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>         cluster-infrastructure="openais" \
>>         expected-quorum-votes="2" \
>>         stonith-enabled="false" \
>>         no-quorum-policy="ignore" \
>>         last-lrm-refresh="1323246238" \
>>         default-resource-stickiness="1000"
>>
>> The problem is to restart corosync or to reboot a cluster node. All
>> resources are stopped except for drbd resource. Than the system hangs
>> for a long time.
>
> Is there a timeout on stopping/demoteing DRBD and do you see kernel
> messages from DRBD about being unable to demote because in use ... or is
> there never an attempt to demote it?
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>> corosync.log:
>>
>> ubuntu0 crmd: [926]: info: do_state_transition: (Re)Issuing shutdown
>> request now that we are the DC
>> ubuntu0 crmd: [926]: info: do_state_transition: Starting PEngine Recheck
>> Timer
>> ubuntu0 crmd: [926]: info: do_shutdown_req: Sending shutdown request to
>> DC: ubuntu0
>> ubuntu0 crmd: [926]: info: handle_shutdown_request: Creating shutdown
>> request for ubuntu0 (state=S_IDLE)
>> corosync [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd
>> (pid=926, seq=6) to terminate...
>> corosync [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd
>> (pid=926, seq=6) to terminate...
>> corosync [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd
>> (pid=926, seq=6) to terminate...
>> corosync [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd
>> (pid=926, seq=6) to terminate...
>> corosync [pcmk  ] notice: pcmk_shutdown: Still waiting for crmd
>> (pid=926, seq=6) to terminate...
>>
>> I tested the same config with a debian 6.0.3. The reboot works. The
>> behaviour there is, that in the first step the drbd resource demote to
>> secondary and then goes down.
>>
>> Is this a known problem??
>>
>> Thank you for help.
>>
>> Regards,
>> Erik
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org