[Pacemaker] Filesystem primitive does not start when one of nodes is switched off

Tue Feb 14 05:41:48 UTC 2012

Hi,
I have a trouble with my test configuration. 
I build an Actice/Active cluster Ubuntu(11.10)+DRBD+Cman+Pacemaker+gfs2+Xen for test purpose.
Now i am doing some tests with availability. I am try to start  cluster on one node.

Trouble is - the Filesystem primitive ClusterFS (fs type=gfs2) does not start when one of two nodes is switched off.

Here my configuration:

node blaster \
        attributes standby="off"
node turrel \
        attributes standby="off"
primitive ClusterData ocf:linbit:drbd \
        params drbd_resource="clusterdata" \
        op monitor interval="60s"
primitive ClusterFS ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/clusterdata" directory="/mnt/cluster" fstype="gfs2" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="60s" \
        op monitor interval="60s" timeout="60s"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="192.168.122.252" cidr_netmask="32" clusterip_hash="sourceip" \
        op monitor interval="30s"
primitive SSH-stonith stonith:ssh \
        params hostlist="turrel blaster" \
        op monitor interval="60s"
primitive XenDom ocf:heartbeat:Xen \
        params xmfile="/etc/xen/xen1.example.com.cfg" \
        meta allow-migrate="true" is-managed="true" target-role="Stopped" \
        utilization cores="1" mem="512" \
        op monitor interval="30s" timeout="30s" \
        op start interval="0" timeout="90s" \
        op stop interval="0" timeout="300s"
ms ClusterDataClone ClusterData \
        meta master-max="2" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
clone ClusterFSClone ClusterFS \
        meta target-role="Started" is-managed="true"
clone IP ClusterIP \
        meta globally-unique="true" clone-max="2" clone-node-max="2"
clone SSH-stonithClone SSH-stonith
location prefere-blaster XenDom 50: blaster
colocation XenDom-with-ClusterFS inf: XenDom ClusterFSClone
colocation fs_on_drbd inf: ClusterFSClone ClusterDataClone:Master
order ClusterFS-after-ClusterData inf: ClusterDataClone:promote ClusterFSClone:start
order XenDom-after-ClusterFS inf: ClusterFSClone XenDom
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="cman" \
        expected-quorum-votes="2" \
        stonith-enabled="true" \
        no-quorum-policy="ignore" \
        last-lrm-refresh="1329194925"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

Here is an $crm resource show:

Master/Slave Set: ClusterDataClone [ClusterData]
     Masters: [ turrel ]
     Stopped: [ ClusterData:1 ]
 Clone Set: IP [ClusterIP] (unique)
     ClusterIP:0        (ocf::heartbeat:IPaddr2) Started
     ClusterIP:1        (ocf::heartbeat:IPaddr2) Started
 Clone Set: ClusterFSClone [ClusterFS]
     Stopped: [ ClusterFS:0 ClusterFS:1 ]
 Clone Set: SSH-stonithClone [SSH-stonith]
     Started: [ turrel ]
     Stopped: [ SSH-stonith:1 ]
 XenDom (ocf::heartbeat:Xen) Stopped

I tryed:
crm(live)resource# cleanup ClusterFSClone
Cleaning up ClusterFS:0 on turrel
Cleaning up ClusterFS:1 on turrel
Waiting for 3 replies from the CRMd... OK

I can see only warn message in /var/log/cluster/corosync.log
Feb 14 16:25:56 turrel pengine: [1640]: WARN: unpack_rsc_op: Processing failed op ClusterFS:0_start_0 on turrel: unknown exec error (-2)
and
Feb 14 16:25:56 turrel pengine: [1640]: WARN: common_apply_stickiness: Forcing ClusterFSClone away from turrel after 1000000 failures (max=1000000)
Feb 14 16:25:56 turrel pengine: [1640]: WARN: common_apply_stickiness: Forcing ClusterFSClone away from turrel after 1000000 failures (max=1000000)

Direct me, please, what i need to check or else?

Best regards,
Dmitriy Bogomolov