[ClusterLabs] Antw: Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Aug 29 02:49:59 EDT 2017
>>> Ferenc Wágner <wferi at niif.hu> schrieb am 28.08.2017 um 18:07 in Nachricht
<87mv6jk75r.fsf at lant.ki.iif.hu>:
[...]
cLVM under I/O load can be really slow (I'm talking about delays in the range
of a few seconds). Be sure to have any timeouts adjusted accordingly. I wrote a
tool that allows to monitor the read latency (as seen by applications), so I
know these numbers. And things get significantly worse if you do cLVM mirroring
with a mirrorlog replicated to each device.
Maybe the CLVM slows down at n^2, where n is the number of nodes; I don't know
;-)
Regards,
Ulrich
> So Pacemaker does nothing, basically, and I can't see any adverse effect
> to resource management, but DLM seems to have some problem, which may or
> may not be related. When the TOTEM error appears, all nodes log this:
>
> vhbl03 dlm_controld[3914]: 2801675 dlm:controld ring 167773705:3056 6 memb
> 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 dlm:ls:clvmd ring 167773705:3056 6 memb
> 167773705 167773706 167773707 167773708 167773709 167773710
> vhbl03 dlm_controld[3914]: 2801675 clvmd wait_messages cg 9 need 1 of 6
> vhbl03 dlm_controld[3914]: 2801675 fence work wait for cluster ringid
> vhbl03 dlm_controld[3914]: 2801675 cluster quorum 1 seq 3056 nodes 6
>
> dlm_controld is running with --enable_fencing=0. Pacemaker does its own
> fencing if resource management requires it, but DLM is used by cLVM
> only, which does not warrant such harsh measures. Right now cLVM is
> blocked; I don't know since when, because we seldom do cLVM operations
> on this cluster. My immediate aim is to unblock cLVM somehow.
>
> While dlm_tool status reports (similar on all nodes):
>
> cluster nodeid 167773705 quorate 1 ring seq 3088 3088
> daemon now 2941405 fence_pid 0
> node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
> node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
> node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
> node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
> node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
> node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
>
> dlm_tool ls shows "kern_stop":
>
> dlm lockspaces
> name clvmd
> id 0x4104eefa
> flags 0x00000004 kern_stop
> change member 5 joined 0 remove 1 failed 1 seq 8,8
> members 167773705 167773706 167773707 167773708 167773710
> new change member 6 joined 1 remove 0 failed 0 seq 9,9
> new status wait messages 1
> new members 167773705 167773706 167773707 167773708 167773709 167773710
>
> on all nodes except for vhbl07 (167773709), where it gives
>
> dlm lockspaces
> name clvmd
> id 0x4104eefa
> flags 0x00000000
> change member 6 joined 1 remove 0 failed 0 seq 11,11
> members 167773705 167773706 167773707 167773708 167773709 167773710
>
> instead.
>
> Does anybody have an idea what the problem(s) might be? Why is Corosync
> deteriorating on this cluster? (It's running with RR PRIO 99.) Could
> that have hurt DLM? Is there a way to unblock DLM without rebooting all
> nodes? (Actually, rebooting is problematic in itself with blocked cLVM,
> but that's tractable.)
> --
> Thanks,
> Feri
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list