[Pacemaker] Pacemaker hang with hardware reset

Wed Jul 4 06:22:55 EDT 2012

Hello Damiano

Do you use drbd fence + pacemaker fence?

2012/7/4 Damiano Scaramuzza <cesello at daimonlab.it>

> Hi all, my first post in this ML.
> I've used in 2008 heartbeat for a big project and now I'm back with
> pacemaker for a smaller one.
>
> I've two nodes with drbd/clvm/ocfs2/kvm virtual machines. all in debian
> wheezy using testing(quite stable) packages.
> I've made configuration with stonith meatware and some colocation rule
> (if needed I can post cib file)
> If I stop gracefully one of two node everything works good (I mean vm
> resources migrate in the other node ,drbd fences and
> all colocation/start-stop orders are fullfilled)
>
> Bad things happens when I force to reset one of two nodes with echo b >
> /proc/sysrq-trigger
>
> Scenario 1) cluster software hang completely, I mean crm_mon returns 2
> nodes online but the other node reboot and stay
> without corosync/pacemaker unloaded. No stonith message at all
>
> Scenario 2) sometimes I see the meatware stonith message, I call
> meatclient and the cluster hang
> Scenario 3) meatware message, call meat client, crm_mon returns "node
> unclean" but I see some resource stopped and some running or Master.
>
> Using the full configuration with  ocfs2 (but I tested gfs2 too) I see
> these messages in syslog
>
> kernel: [ 2277.229622] INFO: task virsh:11370 blocked for more than 120
> seconds.
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229626] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229629] virsh           D
> ffff88041fc53540     0 11370  11368 0x00000000
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229635]  ffff88040b50ce60
> 0000000000000082 0000000000000000 ffff88040f235610
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229642]  0000000000013540
> ffff8803e1953fd8 ffff8803e1953fd8 ffff88040b50ce60
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229648]  0000000000000246
> 0000000181349294 ffff8803f5ca2690 ffff8803f5ca2000
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229655] Call Trace:
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229673]  [<ffffffffa06da2d9>] ?
> ocfs2_wait_for_recovery+0xa2/0xbc [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229679]  [<ffffffff8105f51b>] ?
> add_wait_queue+0x3c/0x3c
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229696]  [<ffffffffa06c8896>] ?
> ocfs2_inode_lock_full_nested+0xeb/0x925 [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229714]  [<ffffffffa06cdd2a>] ?
> ocfs2_permission+0x2b/0xe1 [ocfs2]
> Jun 30 05:36:13 hvlinux02 kernel: [ 2277.229721]  [<ffffffff811019e9>] ?
> unlazy_walk+0x100/0x132
>
>
> So to simplify and exclude ocfs2 from hang I tried drbd/clvm only but
> resetting one node with the same echo b
> I see cluster hang with these messages in syslog
>
> kernel: [ 8747.118110] INFO: task clvmd:8514 blocked for more than 120
> seconds.
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118115] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118119] clvmd           D
> ffff88043fc33540     0  8514      1 0x00000000
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118126]  ffff8803e1b35810
> 0000000000000082 ffff880416efbd00 ffff88042f1f40c0
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118134]  0000000000013540
> ffff8803e154bfd8 ffff8803e154bfd8 ffff8803e1b35810
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118140]  ffffffff8127a5fe
> 0000000000000000 0000000000000000 ffff880411b8a698
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118147] Call Trace:
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118157]  [<ffffffff8127a5fe>] ?
> sock_sendmsg+0xc1/0xde
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118165]  [<ffffffff81349227>] ?
> rwsem_down_failed_common+0xe0/0x114
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118172]  [<ffffffff811b1b64>] ?
> call_rwsem_down_read_failed+0x14/0x30
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118177]  [<ffffffff81348bad>] ?
> down_read+0x17/0x19
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118195]  [<ffffffffa0556a44>] ?
> dlm_user_request+0x3a/0x1a9 [dlm]
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118206]  [<ffffffffa055e61b>] ?
> device_write+0x28b/0x616 [dlm]
> Jun 30 04:59:45 hvlinux01 kernel: [ 8747.118214]  [<ffffffff810eb4a9>] ?
> __kmalloc+0x100/0x112
>
> It seems as dlm or corosync does not talk anymore or does not "sense"
> that the other node is gone
> and all pieces above stay in waiting.
>
> Corosync version is     1.4.2-2
> dlm-pcmk                3.0.12-3.1
> gfs-pcmk                3.0.12-3.1
> ocfs2-tools-pacemaker   1.6.4-1
> pacemaker               1.1.7-1
>
> Any clue?
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120704/8525f625/attachment-0003.html>