[Pacemaker] [Partially SOLVED] pacemaker/dlm problems
Vladislav Bogdanov
bubble at hoster-ok.com
Mon Dec 19 12:11:50 UTC 2011
19.12.2011 14:39, Vladislav Bogdanov wrote:
> 09.12.2011 08:44, Andrew Beekhof wrote:
>> On Fri, Dec 9, 2011 at 3:16 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>> 09.12.2011 03:11, Andrew Beekhof wrote:
>>>> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>>> Hi Andrew,
>>>>>
>>>>> I investigated on my test cluster what actually happens with dlm and
>>>>> fencing.
>>>>>
>>>>> I added more debug messages to dlm dump, and also did a re-kick of nodes
>>>>> after some time.
>>>>>
>>>>> Results are that stonith history actually doesn't contain any
>>>>> information until pacemaker decides to fence node itself.
>>>>
>>>> ...
>>>>
>>>>> From my PoV that means that the call to
>>>>> crm_terminate_member_no_mainloop() does not actually schedule fencing
>>>>> operation.
>>>>
>>>> You're going to have to remind me... what does your copy of
>>>> crm_terminate_member_no_mainloop() look like?
>>>> This is with the non-cman editions of the controlds too right?
>>>
>>> Just latest github's version. You changed some dlm_controld.pcmk
>>> functionality, so it asks stonithd for fencing results instead of XML
>>> magic. But call to crm_terminate_member_no_mainloop() remains the same
>>> there. But yes, that version communicates stonithd directly too.
>>>
>>> SO, the problem here is just with crm_terminate_member_no_mainloop()
>>> which for some reason skips actual fencing request.
>>
>> There should be some logs, either indicating that it tried, or that it failed.
>
> Nothing about fencing.
> Only messages about history requests:
>
> stonith-ng: [1905]: info: stonith_command: Processed st_fence_history
> from cluster-dlm: rc=0
>
> I even moved all fencing code to dlm_controld to have better control on
> what does it do (and not to rebuild pacemaker to play with that code).
> dlm_tool dump prints the same line every second, stonith-ng prints
> history requests.
>
> A little bit odd, by I saw one time that fencing request from
> cluster-dlm succeeded, but only right after node was fenced by
> pacemaker. As a result, node was switched off instead of reboot.
>
> That raises one more question: is it correct to call st->cmds->fence()
> with third parameter set to "off"?
> I think that "reboot" is more consistent with the rest of fencing subsystem.
>
> At the same time, stonith_admin -B succeeds.
> The main difference I see is st_opt_sync_call in a latter case.
> Will try to experiment with it.
Yeeeesssss!!!
Now I see following:
Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info:
pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info:
initiate_remote_stonith_op: Initiating remote operation reboot for
vd01-b: 21425fc0-4311-40fa-9647-525c3f258471
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
vd01-c now has id: 1107559690
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
Processed st_query from vd01-c: rc=0
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
vd01-d now has id: 1124336906
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
Processed st_query from vd01-d: rc=0
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
Processed st_query from vd01-a: rc=0
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
Requesting that vd01-c perform op reboot vd01-b
Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
vd01-b now has id: 1090782474
...
Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command:
Processed st_fence_history from cluster-dlm: rc=0
Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer
vd01-b was terminated (reboot) by vd01-c for vd01-a
(ref=21425fc0-4311-40fa-9647-525c3f258471): OK
But, then I see minor issue that node is marked to be fenced again:
Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b
will be fenced because it is un-expectedly down
...
Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node
vd01-b for STONITH
...
Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing
reboot fencing operation (249) on vd01-b (timeout=60000)
...
Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
Requesting that vd01-c perform op reboot vd01-b
And so on.
I can't investigated this one in more depth, because I use fence_xvm in
this testing cluster, and it has issues when running more than one
stonith resource on a node. Also, my RA (in a cluster where this testing
cluster runs) undefines VM after failure, so fence_xvm does not see
fencing victim in a qpid and is unable to fence it again.
May be it is possible to look if node was just fenced and skip redundant
fencing?
Vladislav
More information about the Pacemaker
mailing list