[Pacemaker] no failover if fencing device is unreachable (i.e. power loss)

Mon Aug 18 19:50:29 CEST 2014

Hi,

I'am building a two-node cluster running XenServer, pacemaker and DRBD. There's a problem when testing the failover by powering off the current active node.
When using the fence_xenapi agent, the resource ClusterIP will not be moved to the 2nd node until the first node was successfully shut down.
However  because the XenAPI is unreachable when the machine is powered off, the 2nd node continuously is trying to shut down the node and the resource is never moved.

To check if it's an error with the fence_xenapi-agent I tried fence_ipmilan which is working fine as long as the IPMI is is reachable. When pulling the power cords from the machine
however the behavior is the same as with the fence_xenapi agent.
Am I missing an option which should be set? A timeout or a retry counter?

Here's how I setup the cluster (fence_xenapi) using pcs:

pcs cluster cib ftp_ha_cluster
pcs -f ftp_ha_cluster resource create ClusterIP IPaddr2 ip=172.20.150.150 cidr_netmask=32 op monitor interval=20s
pcs -f ftp_ha_cluster constraint location ClusterIP prefers ftp-test01=50
pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp1 fence_xenapi pcmk_host_list="ftp-test01" action="off" session_url="https://test-xen-01" port="ftp-test01" login="root" passwd="****" delay=15 op monitor interval=40s
pcs -f ftp_ha_cluster stonith create xenvm-fence-ftp2 fence_xenapi pcmk_host_list="ftp-test02" action="off" session_url="https://test-xen-02" port="ftp-test02" login="root" passwd="****" delay=15 op monitor interval=40s
pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp1 prefers ftp-test01=-INFINITY
pcs -f ftp_ha_cluster constraint location xenvm-fence-ftp2 prefers ftp-test02=-INFINITY
pcs -f ftp_ha_cluster property set stonith-enabled=true
pcs -f ftp_ha_cluster property set stonith-action=off
pcs -f ftp_ha_cluster property set stonith-timeout=40s
pcs -f ftp_ha_cluster property set no-quorum-policy=ignore
pcs -f ftp_ha_cluster resource create Ping ocf:pacemaker:ping dampen="5s" multiplier="100" host_list="172.20.150.1 172.20.150.151 172.20.150.152" attempts="3" op monitor interval=20s
pcs -f ftp_ha_cluster resource clone Ping
pcs -f ftp_ha_cluster constraint location ClusterIP rule score=-INF not_defined pingd or pingd lte 0
pcs -f ftp_ha_cluster constraint location ClusterIP rule score=pingd defined pingd
pcs cluster cib-push ftp_ha_cluster

for testing with fence_ipmilan I replaced the appropriate lines with the following:

pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-01 fence_ipmilan pcmk_host_list="ftp-test01" action="off" ipaddr="test-xen-01-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
pcs -f ftp_ha_cluster stonith create ipmi-fence-test-xen-02 fence_ipmilan pcmk_host_list="ftp-test02" action="off" ipaddr="test-xen-02-bmc.mercateo.lan" auth="password" login="admin" passwd="****" delay=15 op monitor interval=40s
pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-01 prefers ftp-test01=-INFINITY
pcs -f ftp_ha_cluster constraint location ipmi-fence-test-xen-02 prefers ftp-test02=-INFINITY

the content of /etc/corosync/corosync.conf:

compatibility: whitetank

totem {
	version: 2
	secauth: off
	threads: 0
	interface {
		ringnumber: 0
		bindnetaddr: 192.168.199.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
		ttl: 1
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	to_syslog: no
	logfile: /var/log/cluster/corosync.log
	debug: off
	timestamp: on
	logger_subsys {
		subsys: AMF
		debug: off
	}
}

amf {
	mode: disabled
}

service {
	ver:	1
	name:	pacemaker
}

Any idea what could be missing/wrong?

Kind regards,

Felix