[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
Ferenc Wágner
wferi at niif.hu
Fri Sep 1 07:51:11 EDT 2017
Digimer <lists at alteeve.ca> writes:
> On 2017-08-29 10:45 AM, Ferenc Wágner wrote:
>
>> Digimer <lists at alteeve.ca> writes:
>>
>>> On 2017-08-28 12:07 PM, Ferenc Wágner wrote:
>>>
>>>> [...]
>>>> While dlm_tool status reports (similar on all nodes):
>>>>
>>>> cluster nodeid 167773705 quorate 1 ring seq 3088 3088
>>>> daemon now 2941405 fence_pid 0
>>>> node 167773705 M add 196 rem 0 fail 0 fence 0 at 0 0
>>>> node 167773706 M add 5960 rem 5730 fail 0 fence 0 at 0 0
>>>> node 167773707 M add 2089 rem 1802 fail 0 fence 0 at 0 0
>>>> node 167773708 M add 3646 rem 3413 fail 0 fence 0 at 0 0
>>>> node 167773709 M add 2588921 rem 2588920 fail 0 fence 0 at 0 0
>>>> node 167773710 M add 196 rem 0 fail 0 fence 0 at 0 0
>>>>
>>>> dlm_tool ls shows "kern_stop":
>>>>
>>>> dlm lockspaces
>>>> name clvmd
>>>> id 0x4104eefa
>>>> flags 0x00000004 kern_stop
>>>> change member 5 joined 0 remove 1 failed 1 seq 8,8
>>>> members 167773705 167773706 167773707 167773708 167773710
>>>> new change member 6 joined 1 remove 0 failed 0 seq 9,9
>>>> new status wait messages 1
>>>> new members 167773705 167773706 167773707 167773708 167773709 167773710
>>>>
>>>> on all nodes except for vhbl07 (167773709), where it gives
>>>>
>>>> dlm lockspaces
>>>> name clvmd
>>>> id 0x4104eefa
>>>> flags 0x00000000
>>>> change member 6 joined 1 remove 0 failed 0 seq 11,11
>>>> members 167773705 167773706 167773707 167773708 167773709 167773710
>>>>
>>>> instead.
>>>>
>>>> [...] Is there a way to unblock DLM without rebooting all nodes?
>>>
>>> Looks like the lost node wasn't fenced.
>>
>> Why dlm status does not report any lost node then? Or do I misinterpret
>> its output?
>>
>>> Do you have fencing configured and tested? If not, DLM will block
>>> forever because it won't recover until it has been told that the lost
>>> peer has been fenced, by design.
>>
>> What command would you recommend for unblocking DLM in this case?
>
> First, fix fencing. Do you have that setup and working?
I really don't want DLM to do fencing. DLM blocking for a couple of
days is not an issue in this setup (cLVM isn't a "service" of this
cluster, only a rarely needed administration tool). Fencing is set up
and works fine for Pacemaker, so it's used to recover actual HA
services. But letting DLM use it resulted in disaster one and a half
year ago (see Message-ID: <87r3g5a969.fsf at lant.ki.iif.hu>), which I
failed to understand yet, and I'd rather not go there again until that's
taken care of properly. So for now, a manual unblock path is all I'm
after.
--
Thanks,
Feri
More information about the Users
mailing list