[Pacemaker] node1 fencing itself after node2 being fenced

Asgaroth lists at blueface.com
Fri Feb 7 06:22:24 EST 2014


On 06/02/2014 05:52, Vladislav Bogdanov wrote:
> Hi,
>
> I bet your problem comes from the LSB clvmd init script.
> Here is what it does do:
>
> ===========
> ...
> clustered_vgs() {
>      ${lvm_vgdisplay} 2>/dev/null | \
>          awk 'BEGIN {RS="VG Name"} {if (/Clustered/) print $1;}'
> }
>
> clustered_active_lvs() {
>      for i in $(clustered_vgs); do
>          ${lvm_lvdisplay} $i 2>/dev/null | \
>          awk 'BEGIN {RS="LV Name"} {if (/[^N^O^T] available/) print $1;}'
>      done
> }
>
> rh_status() {
>      status $DAEMON
> }
> ...
> case "$1" in
> ...
>    status)
>      rh_status
>      rtrn=$?
>      if [ $rtrn = 0 ]; then
>          cvgs="$(clustered_vgs)"
>          echo Clustered Volume Groups: ${cvgs:-"(none)"}
>          clvs="$(clustered_active_lvs)"
>          echo Active clustered Logical Volumes: ${clvs:-"(none)"}
>      fi
> ...
> esac
>
> exit $rtrn
> =========
>
> So, it not only looks for status of daemon itself, but also tries to
> list volume groups. And this operation is blocked because fencing is
> still in progress, and the whole cLVM thing (as well as DLM itself and
> all other dependent services) is frozen. So your resource timeouts in
> monitor operation, and then pacemaker asks it to stop (unless you have
> on-fail=fence). Anyways, there is a big chance that stop will fail too,
> and that leads again to fencing. cLVM is very fragile in my opinion
> (although newer versions running on corosync2 stack seem to be much
> better). And it is probably still doesn't work well when managed by
> pacemaker in CMAN-based clusters, because it blocks globally if any node
> in the whole cluster is online at the cman layer but doesn't run clvmd
> (I checked last time with .99). And that was the same for all stacks,
> until was fixed for corosync (only 2?) stack recently. The problem with
> that is that you cannot just stop pacemaker on one node (f.e. for
> maintenance), you should immediately stop cman as well (or run clvmd in
> cman'ish way) - cLVM freezes on another node. This should be easily
> fixable in clvmd code, but nobody cares.

Thanks for the explanation, this is interresting for me as I need a 
volume manager in the cluster to manager the shared file systems in case 
I need to resize for some reason. I think I may be coming up against 
something similar now that I am testing cman outside of the cluster, 
even though I have cman/clvmd enabled outside pacemaker the clvmd daemon 
still hangs even when the 2nd node has been rebooted due to a fence 
operation, when it (node 2) reboots, cman & clvmd starts, I can see both 
nodes as members using cman_tool, but clvmd still seems to have an 
issue, it just hangs, I cant see off-hand if dlm still thinks pacemaker 
is in the fence operation (or if it has already returned true for 
successful fence). I am still gathering logs and will post back to this 
thread once I have all my logs from yesterday and this morning.

I dont suppose there is another volume manager available that would be 
cluster aware that anyone is aware of?

>
> Increasing timeout for LSB clvmd resource probably wont help you,
> because blocked (because of DLM waits for fencing) LVM operations iirc
> never finish.
>
> You may want to search for clvmd OCF resource-agent, it is available for
> SUSE I think. Although it is not perfect, it should work much better for
> you

I will have a look around for this clvmd ocf agent, and see what is 
involverd in getting it to work on CentOS 6.5 if I dont have any success 
with the current recommendation for running it outside of pacemaker control.





More information about the Pacemaker mailing list