[Pacemaker] [Partially SOLVED] pacemaker/dlm problems

Vladislav Bogdanov bubble at hoster-ok.com
Mon Nov 14 15:58:49 EST 2011


Ah, one (possibly important) addition.

That offending node is a vm on another cluster which experienced serious
problems that time due to bug in one of my RAs, and cluster host
carrying that vm was hardly shut down. So, it may be possible that
fencing request succeeded right before that host done. There was some
significant time before vm was started on another host. And it also
possible that vm failed again after start. Can't say anything more
precise right now, I needed to return all that to the alive state quickly.
>From the above, may it be possible that last_fenced_time should be set
to zero if node have been seen after if was fenced and then appeared again?

14.11.2011 23:36, Vladislav Bogdanov wrote:
> Hi Andrew,
> 
> I just found another problem with dlm_controld.pcmk (with your latest
> patch from github applied and also my fixes to actually build it - they
> are included in a message referenced by this one).
> One node which just requested fencing of another one stucks at printing
> that message where you print ctime() in fence_node_time() (pacemaker.c
> near 293) every second. No other messages appear, although
> fence_node_time() is called only from check_fencing_done() (cpg.c near
> 444). So, both of (last_fenced_time >= node->fail_time) and
> (!node->fence_queries || node->fence_time != last_fenced_time) are
> false, otherwise one of messages for that cases should be shown. Then,
> fence_node_time() seems to return 0 from
> if (wait_count)
> 	return 0;
> (wait_count is incremented if (last_fenced_time >= node->fail_time) is
> false), so it never reaches check_fencing_done() call and never return
> expected 1.
> Offending node was actually fenced, but that was actually not handled by
> dlm_controld.
> 
> May I ask you to help me a bit with all that logic (as you already dived
> into dlm_controld sources again), I seem to be so near the success... :|
> 
> btw, I cant find what source is your dlm repo forked from, may be you
> remember?
> 
> Best,
> Vladislav
> 
> 28.09.2011 17:41, Vladislav Bogdanov wrote:
>> Hi Andrew,
>>
>>>> All the more reason to start using the stonith api directly.
>>>> I was playing around list night with the dlm_controld.pcmk code:
>>>>    https://github.com/beekhof/dlm/commit/9f890a36f6844c2a0567aea0a0e29cc47b01b787
>>>
>>> Doesn't seem to apply to 3.0.17, so I rebased that commit against it for
>>> my build. Then it doesn't compile without attached patch.
>>> It may need to be rebased a bit against your tree.
>>>
>>> Now I have package built and am building node images. Will try shortly.
>>
>> Fencing from within dlm_controld.pcmk still did not work with your first
>> patch against that _no_mainloop function (expected).
>>
>> So I did my best to build packages from the current git tree.
>>
>> Voila! I got failed node correctly fenced!
>> I'll do some more extensive testing next days, but I believe everything
>> should be much better now.
>>
>> I knew you're genius he-he ;)
>>
>> So, here are steps to get DLM handle CPG NODEDOWN events correctly with
>> pacemaker using openais stack:
>>
>> 1. Build pacemaker (as of 2011-09-28) from git.
>> 2. Apply attached patches to cluster-3.0.17 source tree.
>> 3. Build dlm_controld.pcmk
>>
>> One note - gfs2_controld probably needs to be fixed too (FIXME).
>>
>> Best regards,
>> Vladislav
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker





More information about the Pacemaker mailing list