[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Mon Jan 16 07:45:46 CET 2012

16.01.2012 09:20, Andrew Beekhof wrote:
[snip]
>>> At the same time, stonith_admin -B succeeds.
>>> The main difference I see is st_opt_sync_call in a latter case.
>>> Will try to experiment with it.
>>
>> Yeeeesssss!!!
>>
>> Now I see following:
>> Dec 19 11:53:34 vd01-a cluster-dlm: [2474]: info:
>> pacemaker_terminate_member: Requesting that node 1090782474/vd01-b be fenced
> 
> So the important question... what did you change?

Nice you're back ;)

+ rc = st->cmds->fence(st, *st_opt_sync_call*, node_uname, "reboot", 120);

attaching my resulting version of pacemaker.c (which still has a lot of
mess because of different approaches I tried to get the result and needs
a cleanup). Function you may look at is pacemaker_terminate_member()
which is almost one-to-one copy of crm_terminate_member_no_mainloop()
except rename of variable to compile without warnings and change of
->fence() arguments.

> 
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> vd01-b: 21425fc0-4311-40fa-9647-525c3f258471
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>> vd01-c now has id: 1107559690
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>> Processed st_query from vd01-c: rc=0
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>> vd01-d now has id: 1124336906
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>> Processed st_query from vd01-d: rc=0
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: stonith_command:
>> Processed st_query from vd01-a: rc=0
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
>> Requesting that vd01-c perform op reboot vd01-b
>> Dec 19 11:53:34 vd01-a stonith-ng: [1905]: info: crm_get_peer: Node
>> vd01-b now has id: 1090782474
>> ...
>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: stonith_command:
>> Processed st_fence_history from cluster-dlm: rc=0
>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: tengine_stonith_notify: Peer
>> vd01-b was terminated (reboot) by vd01-c for vd01-a
>> (ref=21425fc0-4311-40fa-9647-525c3f258471): OK
>>
>> But, then I see minor issue that node is marked to be fenced again:
>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: pe_fence_node: Node vd01-b
>> will be fenced because it is un-expectedly down
> 
> Do you have logs for that?
> tengine_stonith_notify() got called, that should have been enough to
> get the node cleaned up in the cib.

Ugh, seems like yes, but they are archived already. Will get them back
to nodes and try to compose hb_report for them (but pe inputs are
already lost, do you still need logs without them?)

> 
>> ...
>> Dec 19 11:53:40 vd01-a pengine: [1909]: WARN: stage6: Scheduling Node
>> vd01-b for STONITH
>> ...
>> Dec 19 11:53:40 vd01-a crmd: [1910]: info: te_fence_node: Executing
>> reboot fencing operation (249) on vd01-b (timeout=60000)
>> ...
>> Dec 19 11:53:40 vd01-a stonith-ng: [1905]: info: call_remote_stonith:
>> Requesting that vd01-c perform op reboot vd01-b
>>
>> And so on.
>>
>> I can't investigated this one in more depth, because I use fence_xvm in
>> this testing cluster, and it has issues when running more than one
>> stonith resource on a node. Also, my RA (in a cluster where this testing
>> cluster runs) undefines VM after failure, so fence_xvm does not see
>> fencing victim in a qpid and is unable to fence it again.
>>
>> May be it is possible to look if node was just fenced and skip redundant
>> fencing?
> 
> If the callbacks are being used correctly, it shouldn't be required
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker.c
Type: text/x-csrc
Size: 10265 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120116/5198804c/attachment-0001.bin>