[Pacemaker] node1 fencing itself after node2 being fenced

Fri Feb 7 07:07:16 EST 2014

07.02.2014 14:22, Asgaroth wrote:
...
> 
> Thanks for the explanation, this is interresting for me as I need a
> volume manager in the cluster to manager the shared file systems in case
> I need to resize for some reason. I think I may be coming up against
> something similar now that I am testing cman outside of the cluster,
> even though I have cman/clvmd enabled outside pacemaker the clvmd daemon
> still hangs even when the 2nd node has been rebooted due to a fence
> operation, when it (node 2) reboots, cman & clvmd starts, I can see both
> nodes as members using cman_tool, but clvmd still seems to have an
> issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker
> is in the fence operation (or if it has already returned true for
> successful fence). I am still gathering logs and will post back to this
> thread once I have all my logs from yesterday and this morning.

As I wrote (may be it was not completely clear) there are two points
where it clustered LVM may block: dlm (kern_stop flag in 'dlm ls'
output) and clvmd itself (not all cluster nodes run clvmd). Of course
there could be additional bugs.

I'd break fencing for your node1 and look what dlm_tool shows there
after node2 is fenced. 'dlm_tool ls' and 'dlm_tool dump' should provide
enough information (but you'd probably need to dig into dlm_controld
code to fully interpret the latter). Also, you may want to run clvmd in
the debugging mode.

> 
> I dont suppose there is another volume manager available that would be
> cluster aware that anyone is aware of?

I'm not aware of any.

> 
>>
>> Increasing timeout for LSB clvmd resource probably wont help you,
>> because blocked (because of DLM waits for fencing) LVM operations iirc
>> never finish.
>>
>> You may want to search for clvmd OCF resource-agent, it is available for
>> SUSE I think. Although it is not perfect, it should work much better for
>> you
> 
> I will have a look around for this clvmd ocf agent, and see what is
> involverd in getting it to work on CentOS 6.5 if I dont have any success
> with the current recommendation for running it outside of pacemaker
> control.

Generally, that alone wont help, because you'll still get timeouts on
every LVM operation if some of cman nodes do not run clvmd for any
reason. I mean, if you manage VGs/LVs as cluster resources. But that
removes one point of failure when combined with newer stack.

I know that latest versions of cluster-stack software (those which
require corosync2 and it's quorum implementation) work like a charm
all-together, and there was a REASON to write them (and use them in RHEL7).

> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org