[Pacemaker] OCFS2 problems when connectivity lost

Wed Dec 21 18:33:57 CET 2011

2011/12/21 Ivan Savčić | Epix <ivan.savcic at epix.rs>:
> Hello,
>
>
> We are having a problem with a 3-node cluster based on Pacemaker/Corosync
> with 2 primary DRBD+OCFS2 nodes and a quorum node.
>
> Nodes run on Debian Squeeze, all packages are from the stable branch except
> for Corosync (which is from backports for udpu functionality). Each node has
> a single network card.

Strongly suggest to also use pacemaker and resource-agents from
squeeze-backports.

> When the network is up, everything works without any problems, graceful
> shutdown of resources on any node works as intended and doesn't reflect on
> the remaining cluster partition.
>
> When the network is down on one OCFS2 node, Pacemaker
> (no-quorum-policy="stop") tries to shut the resources down on that node, but
> fails to stop the OCFS2 filesystem resource stating that it is "in use".

Are you sure you have fencing configured correctly? Normally the
remaining nodes should attempt to fence the misbehaving node.

> *Both* OCFS2 nodes (ie. the one with the network down and the one which is
> still up in the partition with quorum) hang with dmesg reporting that
> events, ocfs2rec and ocfs2_wq are "blocked for more than 120 seconds".

That, again, would be an expected side effect if your fencing
malfunctioned: I/O on the device has to freeze until those nodes that
are scheduled for fencing, are in fact fenced. If that fencing
operation never succeeds, then I/O on the remaining nodes freezes
indefinitely.

> When the network is operational, umount by hand works without any problems,
> because for the testing scenario there are no services running which are
> keeping the mountpoint busy.
>
> Configuration we used is pretty much from "ClusterStack/LucidTesting"
> document [1], with clone-max="2" added where needed because of the
> additional quorum node in comparison to that document.