[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Tue Jan 17 12:34:24 CET 2012

17.01.2012 07:27, Andrew Beekhof wrote:
> On Tue, Jan 17, 2012 at 3:04 PM, Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>> 17.01.2012 04:01, Andrew Beekhof wrote:
>>> On Mon, Jan 16, 2012 at 5:45 PM, Vladislav Bogdanov
>>> <bubble at hoster-ok.com> wrote:
>>>> 16.01.2012 09:20, Andrew Beekhof wrote:
>>>> [snip]
>>>>>>> At the same time, stonith_admin -B succeeds.
>>>>>>> The main difference I see is st_opt_sync_call in a latter case.
>>>>>>> Will try to experiment with it.
>>>>>>
>>>>>> Yeeeesssss!!!
>>>>>>
>>>>>> Now I see following:
>>>>>> Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info:
>>>>>> pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced
>>>>>
>>>>> So the important question... what did you change?
>>>>
>>>> Nice you're back ;)
>>>>
>>>> + rc = st->cmds->fence(st, *st_opt_sync_call*, node_uname, "reboot", 120);
>>>
>>> Really struggling to see how changing anything here can impact whether
>>> the log message /before/ it gets printed.
>>
>> Did I say it? ;)
> 
> Sorry, I pattern matched the pacemaker_terminate_member and thought it
> came from my original function.
> At a loss to explain why your code logs but pacemaker's doesn't.

It was a little bit long ago, so I cannot remember if it is actually
truth, probably not. That message should be there after you fixed
crm_terminate_member_common() to work outside of cluster (that was just
another issue, crm_terminate_member_common() was called, but failed to
proceed to actually request fencing). This message thread is so long so
I can't find this fact anymore...

stonith_api_cs_kick() looks good, didn't try it yet though.
Should I also try to use stonith_api_cs_time() in fence_node_time() and
fence_in_progress()?

Anyways, added it to my TODO list right after experiments with
lustre+drbd+pacemaker+booth which I do right now.

> 
>>
>> Line of the interest here is not
>>
>> Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info:
>> pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced
>>
>> which was added by me it that function, but the next one:
>>
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> vd01-b: 21425fc0-4311-40fa-9647-525c3f258471
>>
>> which indicates that fencing is fired (and the rest).
>>
>>>
>>>>
>>>> attaching my resulting version of pacemaker.c (which still has a lot of
>>>> mess because of different approaches I tried to get the result and needs
>>>> a cleanup). Function you may look at is pacemaker_terminate_member()
>>>> which is almost one-to-one copy of crm_terminate_member_no_mainloop()
>>>> except rename of variable to compile without warnings and change of
>>>> ->fence() arguments.
>>>>
>>>>>
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info:
>>>>>> initiate_remote_stonith_op: Initiating remote operation reboot for
>>>>>> vd01-b: 21425fc0-4311-40fa-9647-525c3f258471
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>>>>>> vd01-c now has id: 1107559690
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>>>>>> Processed st_query from vd01-c: rc=0
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>>>>>> vd01-d now has id: 1124336906
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>>>>>> Processed st_query from vd01-d: rc=0
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>>>>>> Processed st_query from vd01-a: rc=0
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
>>>>>> Requesting that vd01-c perform op reboot vd01-b
>>>>>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>>>>>> vd01-b now has id: 1090782474
>>>>>> ...
>>>>>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command:
>>>>>> Processed st_fence_history from cluster-dlm: rc=0
>>>>>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer
>>>>>> vd01-b was terminated (reboot) by vd01-c for vd01-a
>>>>>> (ref=21425fc0-4311-40fa-9647-525c3f258471): OK
>>>>>>
>>>>>> But, then I see minor issue that node is marked to be fenced again:
>>>>>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b
>>>>>> will be fenced because it is un-expectedly down
>>>>>
>>>>> Do you have logs for that?
>>>>> tengine_stonith_notify() got called, that should have been enough to
>>>>> get the node cleaned up in the cib.
>>>>
>>>> Ugh, seems like yes, but they are archived already. Will get them back
>>>> to nodes and try to compose hb_report for them (but pe inputs are
>>>> already lost, do you still need logs without them?)
>>>>
>>>>>
>>>>>> ...
>>>>>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node
>>>>>> vd01-b for STONITH
>>>>>> ...
>>>>>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing
>>>>>> reboot fencing operation (249) on vd01-b (timeout=60000)
>>>>>> ...
>>>>>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
>>>>>> Requesting that vd01-c perform op reboot vd01-b
>>>>>>
>>>>>> And so on.
>>>>>>
>>>>>> I can't investigated this one in more depth, because I use fence_xvm in
>>>>>> this testing cluster, and it has issues when running more than one
>>>>>> stonith resource on a node. Also, my RA (in a cluster where this testing
>>>>>> cluster runs) undefines VM after failure, so fence_xvm does not see
>>>>>> fencing victim in a qpid and is unable to fence it again.
>>>>>>
>>>>>> May be it is possible to look if node was just fenced and skip redundant
>>>>>> fencing?
>>>>>
>>>>> If the callbacks are being used correctly, it shouldn't be required
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org