[Pacemaker] [Partially SOLVED] pacemaker/dlm problems
Andrew Beekhof
andrew at beekhof.net
Mon Jan 16 07:17:28 CET 2012
Sorry for not getting to this earlier...
On Mon, Dec 19, 2011 at 10:39 PM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> 09.12.2011 08:44, Andrew Beekhof wrote:
>> On Fri, Dec 9, 2011 at 3:16 PM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>> 09.12.2011 03:11, Andrew Beekhof wrote:
>>>> On Fri, Dec 2, 2011 at 1:32 AM, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>>>> Hi Andrew,
>>>>>
>>>>> I investigated on my test cluster what actually happens with dlm and
>>>>> fencing.
>>>>>
>>>>> I added more debug messages to dlm dump, and also did a re-kick of nodes
>>>>> after some time.
>>>>>
>>>>> Results are that stonith history actually doesn't contain any
>>>>> information until pacemaker decides to fence node itself.
>>>>
>>>> ...
>>>>
>>>>> From my PoV that means that the call to
>>>>> crm_terminate_member_no_mainloop() does not actually schedule fencing
>>>>> operation.
>>>>
>>>> You're going to have to remind me... what does your copy of
>>>> crm_terminate_member_no_mainloop() look like?
>>>> This is with the non-cman editions of the controlds too right?
>>>
>>> Just latest github's version. You changed some dlm_controld.pcmk
>>> functionality, so it asks stonithd for fencing results instead of XML
>>> magic. But call to crm_terminate_member_no_mainloop() remains the same
>>> there. But yes, that version communicates stonithd directly too.
>>>
>>> SO, the problem here is just with crm_terminate_member_no_mainloop()
>>> which for some reason skips actual fencing request.
>>
>> There should be some logs, either indicating that it tried, or that it failed.
>
> Nothing about fencing.
> Only messages about history requests:
>
> stonith-ng: [1905]: info: stonith_command: Processed st_fence_history
> from cluster-dlm: rc=0
The logs would be from the dlm, since thats who's calling
crm_terminate_member_no_mainloop().
>
> I even moved all fencing code to dlm_controld to have better control on
> what does it do (and not to rebuild pacemaker to play with that code).
> dlm_tool dump prints the same line every second, stonith-ng prints
> history requests.
>
> A little bit odd, by I saw one time that fencing request from
> cluster-dlm succeeded, but only right after node was fenced by
> pacemaker. As a result, node was switched off instead of reboot.
>
> That raises one more question: is it correct to call st->cmds->fence()
> with third parameter set to "off"?
> I think that "reboot" is more consistent with the rest of fencing subsystem.
Either is legitimate.
>
> At the same time, stonith_admin -B succeeds.
> The main difference I see is st_opt_sync_call in a latter case.
> Will try to experiment with it.
/Shouldn't/ matter.
More information about the Pacemaker
mailing list