[Pacemaker] failed over filesystem mount points not coming up on secondary node

Thu Sep 27 18:10:16 EDT 2012

Greetings,
I've just started playing with pacemaker/corosync on a two node setup.
 At this point I'm just experimenting, and trying to get a good feel
of how things work.  Eventually I'd like to start using this in a
production environment.  I'm running Fedora16-x86_64 with
pacemaker-1.1.7 & corosync-1.4.3.  I have DRBD setup and working fine
with two resources.  I've verified that pacemaker is doing the right
thing when initially configured.  Specifically:
* the floating static IP is brought up
* DRBD is brought up correctly with a master & slave
* the local DRBD backed mount points are mounted correctly

Here's the configuration:
#########
node farm-ljf0 \
	attributes standby="off"
node farm-ljf1
primitive ClusterIP ocf:heartbeat:IPaddr2 \
	params ip="10.31.97.100" cidr_netmask="22" nic="eth1" \
	op monitor interval="10s"
primitive FS0 ocf:linbit:drbd \
	params drbd_resource="r0" \
	op monitor interval="10" role="Master" \
	op monitor interval="30" role="Slave"
primitive FS0_drbd ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/mnt/sdb1" fstype="xfs"
primitive FS1 ocf:linbit:drbd \
	params drbd_resource="r1" \
	op monitor interval="10s" role="Master" \
	op monitor interval="30s" role="Slave"
primitive FS1_drbd ocf:heartbeat:Filesystem \
	params device="/dev/drbd1" directory="/mnt/sdb2" fstype="xfs"
ms FS0_Clone FS0 \
	meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
ms FS1_Clone FS1 \
	meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true"
location cli-prefer-ClusterIP ClusterIP \
	rule $id="cli-prefer-rule-ClusterIP" inf: #uname eq farm-ljf1
colocation fs0_on_drbd inf: FS0_drbd FS0_Clone:Master
colocation fs1_on_drbd inf: FS1_drbd FS1_Clone:Master
order FS0_drbd-after-FS0 inf: FS0_Clone:promote FS0_drbd
order FS1_drbd-after-FS1 inf: FS1_Clone:promote FS1_drbd
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore"
#########

However, when I attempted to simulate a failover situation (I shutdown
the current master/primary node completely), not everything failed
over correctly.  Specifically, the mount points did not get mounted,
even though the other two elements did failover correctly.
'farm-ljf1' is the node that I shutdown, farm-ljf0 is the node that I
expected to inherit all of the resources.  Here's the status:
#########
[root at farm-ljf0 ~]# crm status
============
Last updated: Thu Sep 27 15:00:19 2012
Last change: Thu Sep 27 13:59:42 2012 via cibadmin on farm-ljf1
Stack: openais
Current DC: farm-ljf0 - partition WITHOUT quorum
Version: 1.1.7-2.fc16-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
7 Resources configured.
============

Online: [ farm-ljf0 ]
OFFLINE: [ farm-ljf1 ]

 ClusterIP	(ocf::heartbeat:IPaddr2):	Started farm-ljf0
 Master/Slave Set: FS0_Clone [FS0]
     Masters: [ farm-ljf0 ]
     Stopped: [ FS0:0 ]
 Master/Slave Set: FS1_Clone [FS1]
     Masters: [ farm-ljf0 ]
     Stopped: [ FS1:0 ]

Failed actions:
    FS1_drbd_start_0 (node=farm-ljf0, call=23, rc=1, status=complete):
unknown error
    FS0_drbd_start_0 (node=farm-ljf0, call=24, rc=1, status=complete):
unknown error
#########

I eventually brought up the shut down node (farm-ljf1) again, hoping
that might at least bring things back into a good state, but its not
working either, and is showing up as OFFLINE:
##########
[root at farm-ljf1 ~]# crm status
============
Last updated: Thu Sep 27 15:06:54 2012
Last change: Thu Sep 27 14:49:06 2012 via cibadmin on farm-ljf1
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
7 Resources configured.
============

OFFLINE: [ farm-ljf0 farm-ljf1 ]
##########

So at this point, I've got two problems:
0) FS mount failover isn't working.  I'm hoping this is some silly
configuration issue that can be easily resolved.
1) bringing the "failed" farm-ljf1 node back online doesn't seem to
work automatically, and I can't figure out what kind of magic is
needed.

If this stuff is documented somewhere, I'll gladly read it, if someone
can point me in the right direction.

thanks!