[Pacemaker] Problem with ocf.pacemaker.pingd and host unreachable.

Fri May 13 13:56:24 UTC 2011

Hello Pierre,

I see you don't define your IPs in the cluster. Am I right you want to ensure
that the active node is reaching most of your pingd IPs?

I have a simple failover config that actually works as follows:
- DRBD active/active
- one node (default gateway) is pinged by 'pingd' from any cluster node
- if the gateway is unreachable, the node with the greatest 'pingd' score gets
the cluster IP assigned and all resources that depend on it. Filesystem always
stays active/active.

I have no LVM or SCSI active though, it's a GFS2 config for legacy cluster
resources like apache, nagios3 and such kind, which cannot replicate themselves.

Currently I cannot imagine a case in which a location restraint for a
master/master resource would do any good.

Besides that you should configure fencing in your drbd.d/global_common.conf (or
use STONITH), as this is meant to ensure that one node stops writing to it's
filesystem. The drbd fencing scripts only make sense in case you use two network
connections, which you do. 

Here's my example config. One network (192.168.56.0/24) is for external
communication, the other network (10.0.0.0/30) is a simple crossover connection
for drbd traffic. I removed irrelevant parts:
---
node clusternode1
node clusternode2
primitive resDLM ocf:pacemaker:controld \
	op monitor interval="30s" timeout="20s" start-delay="0s" \
	op start timeout="90s" op stop timeout="100s"
primitive resDRBD ocf:linbit:drbd \
	params drbd_resource="drbd0" \
	op start timeout="240" op promote timeout="90" \
	op demote timeout="90" op stop timeout="100" \
	op monitor interval="10" timeout="20" start-delay="1min" \
	op notify timeout="90"
primitive resFS ocf:heartbeat:Filesystem \
	params device="/dev/drbd0" directory="/srv" fstype="gfs2" \
	op start  timeout="60" op stop  timeout="60" \
	op monitor interval="20" timeout="40" start-delay="0" \
	op notify timeout="60"
primitive resGFS2CTL ocf:pacemaker:controld \
	op monitor interval="30s" timeout="20s" \
	op start timeout="90s" op stop timeout="100s"
primitive resIP ocf:heartbeat:IPaddr2 \
	params ip="192.168.56.20" nic="eth1" cidr_netmask="24" iflabel="eth1" \
	op start timeout="20" op stop timeout="20" \
	op monitor interval="10" timeout="20" start-delay="0" \
	meta resource-stickiness="50"
primitive resPINGD ocf:pacemaker:pingd \
	params host_list="192.168.56.1" dampen="5s" multiplier="100" interval="2s" \
	op monitor interval="10s" timeout="20s" start-delay="30s" \
	op start timeout="90s" op stop timeout="100s"
group grpGFSMGMT resDLM resGFS2CTL
ms msDRBD resDRBD meta master-max="2" clone-max="2" notify="true"
clone cloneFS resFS 
clone cloneGFSMGMT grpGFSMGMT 
clone clonePINGD resPINGD
location locIP resIP rule $id="locIP-rule" pingd: defined pingd
colocation colDRBD inf: cloneGFSMGMT msDRBD:Master
colocation colFS inf: cloneFS cloneGFSMGMT
order ordFS inf: msDRBD:promote cloneGFSMGMT:start cloneFS:start
property $id="cib-bootstrap-options" \
	expected-quorum-votes="2" \
	dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
	no-quorum-policy="ignore" \
	cluster-infrastructure="openais" \
	stonith-enabled="false"
---

hope it's of value for you. The config mainly is taken out of Michael
Schwartzkopff's (german) book which I use to learn clustering. It only fails in
case I unplug the crossover cable, because of an error in drbd I think
(sometimes gives a kernel OOPS when invoking 'drbdadm fence-peer minor-0' - but
that's another story, completely related to drbd)

regards

Thomas