[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Vladislav Bogdanov bubble at hoster-ok.com
Thu Dec 1 09:32:37 EST 2011


Hi Andrew,

I investigated on my test cluster what actually happens with dlm and
fencing.

I added more debug messages to dlm dump, and also did a re-kick of nodes
after some time.

Results are that stonith history actually doesn't contain any
information until pacemaker decides to fence node itself.

Testcase I used is: killall -9 dlm_controld.pcmk on one node,
After that I see in dlm dump:
1322748122 dlm:controld conf 3 0 1 memb 1074005258 1090782474 1124336906
join left 1107559690
1322748122 dlm:ls:clvmd conf 3 0 1 memb 1074005258 1090782474 1124336906
join left 1107559690
1322748122 clvmd add_change cg 7 remove nodeid 1107559690 reason 5
1322748122 Requested that node 1107559690 be kicked from the cluster
1322748122 clvmd add_change cg 7 counts member 3 joined 0 remove 1 failed 1
1322748122 clvmd stop_kernel cg 7
1322748122 write "0" to "/sys/kernel/dlm/clvmd/control"
1322748122 It does not appear node 1107559690/vd01-c has been shot
1322748122 clvmd check_fencing 1107559690 wait add 1322748073 fail
1322748122 last 0
1322748122 It does not appear node 1107559690/vd01-c has been shot
1322748123 It does not appear node 1107559690/vd01-c has been shot
...
1322748133 It does not appear node 1107559690/vd01-c has been shot
1322748133 Requested that node 1107559690 be kicked from the cluster
1322748134 It does not appear node 1107559690/vd01-c has been shot
...
1322748276 It does not appear node 1107559690/vd01-c has been shot
1322748276 Requested that node 1107559690 be kicked from the cluster
1322748277 It does not appear node 1107559690/vd01-c has been shot
1322748278 It does not appear node 1107559690/vd01-c has been shot
1322748279 It does not appear node 1107559690/vd01-c has been shot
1322748280 It does not appear node 1107559690/vd01-c has been shot
1322748281 It does not appear node 1107559690/vd01-c has been shot
1322748282 It does not appear node 1107559690/vd01-c has been shot
1322748283 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748284 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748285 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748286 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748287 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748288 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748289 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748290 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748291 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748292 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748293 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748294 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748295 Stonith history[0]: Fencing of node 1107559690/vd01-c is in
progress
1322748296 Processing membership 488
1322748296 Skipped active node 1124336906: born-on=476, last-seen=488,
this-event=488, last-event=484
1322748296 Skipped active node 1074005258: born-on=484, last-seen=488,
this-event=488, last-event=484
1322748296 Skipped active node 1090782474: born-on=464, last-seen=488,
this-event=488, last-event=484
1322748296 del_configfs_node rmdir
"/sys/kernel/config/dlm/cluster/comms/1107559690"
1322748296 Removed inactive node 1107559690: born-on=468, last-seen=484,
this-event=488, last-event=484
1322748296 Stonith history[0]: Node 1107559690/vd01-c fenced at 1322748296
1322748296 Node 1107559690/vd01-c was last shot at: 1322748296
1322748296 clvmd check_fencing 1107559690 done add 1322748073 fail
1322748122 last 1322748296

So, first stonith history entry appeared only after 161 second after
initial fencing attempt.
And that corresponds to following log lines (1322748283 = Dec 01 2011
14:04:43 UTC):
Dec  1 14:04:42 vd01-b pengine: [1894]: WARN: stage6: Scheduling Node
vd01-c for STONITH
Dec  1 14:04:42 vd01-b pengine: [1894]: WARN: native_stop_constraints:
Stop of failed resource dlm:2 is implicit after vd01-c is fenced
Dec  1 14:04:42 vd01-b pengine: [1894]: WARN: native_stop_constraints:
Stop of failed resource clvmd:2 is implicit after vd01-c is fenced

>From my PoV that means that the call to
crm_terminate_member_no_mainloop() does not actually schedule fencing
operation.

Best,
Vladislav





More information about the Pacemaker mailing list