[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Andrew Beekhof andrew at beekhof.net
Thu Nov 24 00:49:57 EST 2011


On Thu, Nov 24, 2011 at 3:58 PM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> 24.11.2011 07:33, Andrew Beekhof wrote:
>> On Tue, Nov 15, 2011 at 7:36 AM, Vladislav Bogdanov
>> <bubble at hoster-ok.com> wrote:
>>> Hi Andrew,
>>>
>>> I just found another problem with dlm_controld.pcmk (with your latest
>>> patch from github applied and also my fixes to actually build it - they
>>> are included in a message referenced by this one).
>>> One node which just requested fencing of another one stucks at printing
>>> that message where you print ctime() in fence_node_time() (pacemaker.c
>>> near 293) every second.
>>
>> So not blocked, it just keeps repeating that message?
>> What date does it print?
>
> Blocked... kern_stop

I'm confused.
How can it do that every second?

>
> It prints the same date not so far ago (in that case).
> I did catch it only once and cannot repeat yet. Date is printed correct
> in a "normal" fencing circumstances.
>
>>
>> Did you change it to the following?
>>   log_debug("Node %d was last shot at: %s", nodeid, ctime(*last_fenced_time));
>
> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg09959.html
> contains patches against 3.0.17 which I use. I only backported commits
> to dlm_controld core from 3.1.1 (and 3.1.7 last days) to make it up2date
> (they are minor).

Ok, this (which was from my original patch) is wrong:

+        log_debug("Node %d/%s was last shot at: %s", nodeid,
ctime(*last_fenced_time));

The format string expects 3 parameters but there are only 2 supplied.
This could easily result in what you're seeing.


>
> man ctime
> char *ctime(const time_t *timep);
>
> int fence_node_time(int nodeid, uint64_t *last_fenced_time)
> is called from check_fencing_done() with
> uint64_t last_fenced_time;
> rv = fence_node_time(node->nodeid, &last_fenced_time);
> so, I changed it to ctime(last_fenced_time). btw ctime adds trailing
> newline, so it badly fits for logs.
>
> One thought: may be last commits to dlm.git (with membership monitoring,
> notably e529211682418a8e33feafc9f703cff87e23aeba) may help here?
>
> And one note - I use fence_xvm for that failed VM, and I found that it
> is a little bit deficient - only one instance of it can be run on a host
> simultaneously as it binds to the predefined TCP port. May be that may
> influence as well...
>
>>
>>> No other messages appear, although
>>> fence_node_time() is called only from check_fencing_done() (cpg.c near
>>> 444). So, both of (last_fenced_time >= node->fail_time) and
>>> (!node->fence_queries || node->fence_time != last_fenced_time) are
>>> false, otherwise one of messages for that cases should be shown. Then,
>>> fence_node_time() seems to return 0 from
>>> if (wait_count)
>>>        return 0;
>>> (wait_count is incremented if (last_fenced_time >= node->fail_time) is
>>> false), so it never reaches check_fencing_done() call and never return
>>> expected 1.
>>> Offending node was actually fenced, but that was actually not handled by
>>> dlm_controld.
>>>
>>> May I ask you to help me a bit with all that logic (as you already dived
>>> into dlm_controld sources again), I seem to be so near the success... :|
>>>
>>> btw, I cant find what source is your dlm repo forked from, may be you
>>> remember?
>>
>> iirc, it was dlm.git on fedorahosted.
>
> Yep, I found that already, pacemaker branch. It seems to be a little bit
> outdated comparing to 3.0.17 btw.
>
>>
>>>
>>> Best,
>>> Vladislav
>>>
>>> 28.09.2011 17:41, Vladislav Bogdanov wrote:
>>>> Hi Andrew,
>>>>
>>>>>> All the more reason to start using the stonith api directly.
>>>>>> I was playing around list night with the dlm_controld.pcmk code:
>>>>>>    https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787
>>>>>
>>>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for
>>>>> my build. Then it doesn't compile without attached patch.
>>>>> It may need to be rebased a bit against your tree.
>>>>>
>>>>> Now I have package built and am building node images. Will try shortly.
>>>>
>>>> Fencing from within dlm_controld.pcmk still did not work with your first
>>>> patch against that _no_mainloop function (expected).
>>>>
>>>> So I did my best to build packages from the current git tree.
>>>>
>>>> Voila! I got failed node correctly fenced!
>>>> I'll do some more extensive testing next days, but I believe everything
>>>> should be much better now.
>>>>
>>>> I knew you're genius he-he ;)
>>>>
>>>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with
>>>> pacemaker using openais stack:
>>>>
>>>> 1. Build pacemaker (as of 2011-09-28) from git.
>>>> 2. Apply attached patches to cluster-3.0.17 source tree.
>>>> 3. Build dlm_controld.pcmk
>>>>
>>>> One note - gfs2_controld probably needs to be fixed too (FIXME).
>>>>
>>>> Best regards,
>>>> Vladislav
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>>
>
>




More information about the Pacemaker mailing list