[Pacemaker] pacemaker/dlm problems

Tue Sep 27 05:59:14 UTC 2011

On Mon, Sep 26, 2011 at 6:41 PM, Vladislav Bogdanov
<bubble at hoster-ok.com> wrote:
> 26.09.2011 11:16, Andrew Beekhof wrote:
> [snip]
>>>
>>>>
>>>> Regardless, for 1.1.6 the dlm would be better off making a call like:
>>>>
>>>>           rc = st->cmds->fence(st, st_opts, target, "reboot", 120);
>>>>
>>>> from fencing/admin.c
>>>>
>>>> That would talk directly to the fencing daemon, bypassing attrd, crnd
>>>> and PE - and thus be more reliable.
>>>>
>>>> This is what the cman plugin will be doing soon too.
>>>
>>> Great to know, I'll try that in near future. Thank you very much for
>>> pointer.
>>
>> 1.1.7 will actually make use of this API regardless of any *_controld
>> changes - i'm in the middle of updating the two library functions they
>> use (crm_terminate_member and crm_terminate_member_no_mainloop).
>
> Ah, I then try your patch and wait for that to be resolved.
>
>>
>>>
>>>>
>>>>>
>>>>> I agree with Jiaju
>>>>> (https://lists.linux-foundation.org/pipermail/openais/2011-September/016713.html),
>>>>> that could be solely pacemaker problem, because it probably should
>>>>> originate fencing itself is such situation I think.
>>>>>
>>>>> So, using pacemaker/dlm with openais stack is currently risky due to
>>>>> possible hangs of dlm_lockspaces.
>>>>
>>>> It shouldn't be, failing to connect to attrd is very unusual.
>>>
>>> By the way, one of underlying problems, which actually made me to notice
>>> all this, is that pacemaker cluster does not fence its DC if it leaves
>>> the cluster for a very short time. That is what Jiaju told in his notes.
>>> And I can confirm that.
>>
>> Thats highly surprising.  Do the logs you sent display this behaviour?
>
> They do. Rest of the cluster begins the election, but then accepts
> returned DC back (I write this from memory, I looked at logs Sep 5-6, so
> I may mix up something).

Actually, this might be possible - if DC.old came back before DC.new
had a chance to get elected, run the PE and initiate fencing, then
there would be no need to fence.

> [snip]
>>>>> Although it took 25 seconds instead of 3 to break the cluster (I
>>>>> understand, this is almost impossible to load host so much, but
>>>>> anyways), then I got a real nightmare: two nodes of 3-node cluster had
>>>>> cman stopped (and pacemaker too because of cman connection loss) - they
>>>>> asked to kick_node_from_cluster() for each other, and that succeeded.
>>>>> But fencing didn't happen (I still need to look why, but this is cman
>>>>> specific).
>
> Btw this part is tricky for me to understand the underlying logic:
> * cman just stops cman processes on remote nodes, disregarding the
> quorum. I hope that could be fixed in corosync If I understand one of
> latest threads there right.
> * But cman does not do fencing of that nodes, and they still run
> resources. And this could be extremely dangerous under some
> circumstances. And cman does not do fencing even if it has fence devices
> configure in cluster.conf (I verified that).
>
>>>>> Remaining node had pacemaker hanged, it doesn't even
>>>>> notice cluster infrastructure change, down nodes were listed as a
>>>>> online, one of them was a DC, all resources are marked as started on all
>>>>> (down too) nodes. No log entries from pacemaker at all.
>>>>
>>>> Well I can't see any logs from anyone to its hard for me to comment.
>>>
>>> Logs are sent privately.
>>>
>>>>
>
> Vladislav
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>