[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Vladislav Bogdanov bubble at hoster-ok.com
Wed Nov 23 23:58:21 EST 2011


24.11.2011 07:33, Andrew Beekhof wrote:
> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
> <bubble at hoster-ok.com> wrote:
>> Hi Andrew,
>>
>> I just found another problem with dlm_controld.pcmk (with your latest
>> patch from github applied and also my fixes to actually build it - they
>> are included in a message referenced by this one).
>> One node which just requested fencing of another one stucks at printing
>> that message where you print ctime() in fence_node_time() (pacemaker.c
>> near 293) every second.
> 
> So not blocked, it just keeps repeating that message?
> What date does it print?

Blocked... kern_stop

It prints the same date not so far ago (in that case).
I did catch it only once and cannot repeat yet. Date is printed correct
in a "normal" fencing circumstances.

> 
> Did you change it to the following?
>   log_debug("Node %d was last shot at: %s", nodeid, ctime(*last_fenced_time));	

http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html
contains patches against 3.0.17 which I use. I only backported commits
to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date
(they are minor).

man ctime
char *ctime(const time_t *timep);

int fence_node_time(int nodeid, uint64_t *last_fenced_time)
is called from check_fencing_done() with
uint64_t last_fenced_time;
rv = fence_node_time(node->nodeid, &last_fenced_time);
so, I changed it to ctime(last_fenced_time). btw ctime adds trailing
newline, so it badly fits for logs.

One thought: may be last commits to dlm.git (with membership monitoring,
notably e529211682418a8e33feafc9f703cff87e23aeba) may help here?

And one note - I use fence_xvm for that failed VM, and I found that it
is a little bit deficient - only one instance of it can be run on a host
simultaneously as it binds to the predefined TCP port. May be that may
influence as well...

> 
>> No other messages appear, although
>> fence_node_time() is called only from check_fencing_done() (cpg.c near
>> 444). So, both of (last_fenced_time >= node->fail_time) and
>> (!node->fence_queries || node->fence_time != last_fenced_time) are
>> false, otherwise one of messages for that cases should be shown. Then,
>> fence_node_time() seems to return 0 from
>> if (wait_count)
>>        return 0;
>> (wait_count is incremented if (last_fenced_time >= node->fail_time) is
>> false), so it never reaches check_fencing_done() call and never return
>> expected 1.
>> Offending node was actually fenced, but that was actually not handled by
>> dlm_controld.
>>
>> May I ask you to help me a bit with all that logic (as you already dived
>> into dlm_controld sources again), I seem to be so near the success... :|
>>
>> btw, I cant find what source is your dlm repo forked from, may be you
>> remember?
> 
> iirc, it was dlm.git on fedorahosted.

Yep, I found that already, pacemaker branch. It seems to be a little bit
outdated comparing to 3.0.17 btw.

> 
>>
>> Best,
>> Vladislav
>>
>> 28.09.2011 17:41, Vladislav Bogdanov wrote:
>>> Hi Andrew,
>>>
>>>>> All the more reason to start using the stonith api directly.
>>>>> I was playing around list night with the dlm_controld.pcmk code:
>>>>>    https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787
>>>>
>>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for
>>>> my build. Then it doesn't compile without attached patch.
>>>> It may need to be rebased a bit against your tree.
>>>>
>>>> Now I have package built and am building node images. Will try shortly.
>>>
>>> Fencing from within dlm_controld.pcmk still did not work with your first
>>> patch against that _no_mainloop function (expected).
>>>
>>> So I did my best to build packages from the current git tree.
>>>
>>> Voila! I got failed node correctly fenced!
>>> I'll do some more extensive testing next days, but I believe everything
>>> should be much better now.
>>>
>>> I knew you're genius he-he ;)
>>>
>>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with
>>> pacemaker using openais stack:
>>>
>>> 1. Build pacemaker (as of 2011-09-28) from git.
>>> 2. Apply attached patches to cluster-3.0.17 source tree.
>>> 3. Build dlm_controld.pcmk
>>>
>>> One note - gfs2_controld probably needs to be fixed too (FIXME).
>>>
>>> Best regards,
>>> Vladislav
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>





More information about the Pacemaker mailing list