[Pacemaker] Drbd/Nfs MS don't failover on slave node

Mon Jul 5 10:41:05 UTC 2010

  Hello,

i searched the list, tried lots of things but nothing works, so i try to 
post here.
I'd like to say my configuration worked on heartbeat2/crm, but since i 
migrated to corosync/pacemaker i have a problem.
Here is my cib :

node filer1 \
         attributes standby="off"
node filer2 \
         attributes standby="off"
primitive drbd_nfs ocf:linbit:drbd \
         params drbd_resource="r0" \
         op monitor interval="15s" timeout="60"
primitive fs_nfs ocf:heartbeat:Filesystem \
         op monitor interval="120s" timeout="60s" \
         params device="/dev/drbd0" directory="/data" fstype="ext4"
primitive ip_failover heartbeat:OVHfailover.py \
         op monitor interval="120s" timeout="60s" \
         params 1="cgXXXX-ovh" 2="******" 3="*****.ovh.net" 4="ip.ip.ip.ip"
primitive ip_nfs ocf:heartbeat:IPaddr2 \
         op monitor interval="60s" timeout="20s" \
         params ip="192.168.0.20" cidr_netmask="24" nic="vlan2019"
primitive nfs_server lsb:nfs \
         op monitor interval="120s" timeout="60s"
group group_nfs ip_nfs fs_nfs nfs_server ip_failover \
         meta target-role="Started"
ms ms_drbd_nfs drbd_nfs \
         meta master-max="1" master-node-max="1" clone-max="2" 
clone-node-max="1" notify="true" target-role="Master"
colocation nfs_on_drbd inf: group_nfs ms_drbd_nfs:Master
order nfs_after_drbd inf: ms_drbd_nfs:promote group_nfs:start
property $id="cib-bootstrap-options" \
         symmetric-cluster="true" \
         no_quorum-policy="stop" \
         default-resource-stickiness="0" \
         default-resource-failure-stickiness="0" \
         stonith-enabled="false" \
         stonith-action="reboot" \
         stop-orphan-resources="true" \
         stop-orphan-actions="true" \
         remove-after-stop="false" \
         short-resource-names="true" \
         transition-idle-timeout="3min" \
         default-action-timeout="30s" \
         is-managed-default="true" \
         startup-fencing="true" \
         cluster-delay="60s" \
         expected-nodes="1" \
         election_timeout="50s" \
         expected-quorum-votes="2" \
         dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
         cluster-infrastructure="openais"

So i have a DRBD ressource set as master/slave. And i have a group with 
OVHfailover (a custom script i made to migrate a failover ip for my 
hosting provide, this one works without problem)), Filesystem (to mount 
the drbd0) NFS (to start nfs server) and IPaddr2 (to attach an ip in a 
vlan).

Now i start my two nodes :
#crm_mon
============
Last updated: Mon Jul  5 12:24:04 2010
Stack: openais
Current DC: filer1.connecting-nature.com - partition with quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ filer2 filer1 ]

  Resource Group: group_nfs
      ip_nfs     (ocf::heartbeat:IPaddr2):       Started filer1
      fs_nfs     (ocf::heartbeat:Filesystem):    Started filer1
      nfs_server (lsb:nfs):      Started filer1
      ip_failover        (heartbeat:OVHfailover.py):     Started filer1
  Master/Slave Set: ms_drbd_nfs
      Masters: [ filer1 ]
      Slaves: [ filer2 ]

Everything's fine.
Now, i stop the filer1
#/etc/init.d/corosync stop

It stops correctly

but in crm_mon:

============
Last updated: Mon Jul  5 11:28:59 2010
Stack: openais
Current DC: filer1.connecting-nature.com - partition WITHOUT quorum
Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ filer2 ]
OFFLINE: [ filer1 ]

And nothing happens (ressources doesn't migrate to filer2 which is 
online, in fact, like pasted above, they doesn't appears)

Now if i restart filer1, ressouce will migrate to filer2 and start on 
filer2 once filer1 restarted..

I don't know where is my mistake. I tried lots of several config, but 
each time nothing happen. In worst case
it starts an infinite loop where filer1 try to promote, then stop, the 
filer2 try to promote then stop, again and again (one loop takes ~1sec)

Thanks for your help

Guillaume