[Pacemaker] Resource does not come back to a node after node recovers from network issue

Prakash Velayutham prakash.velayutham at cchmc.org
Thu Jul 21 17:00:25 CET 2011


Hello all,

I have a 2 node cluster running

Corosync - 1.2.1
Pacemaker - 1.1.2

Both nodes have the primary (production) and private (heartbeat) networks bonded across 2 separate ethernet interfaces. eth0/eth1 for bond0 (primary) and eth2/eth3 for bond1 (private). I am trying to test the migration of resources by downing the production bond.

I am seeing a strange issue as described below.

1. Assume the resources are currently hosted at node1.
2. If I do "ifdown bond0", I can see that g_mysqlp-1 group resource migrates to node2.
3. If I do "ifup bond0" on node1 and then do a "ifdown bond0" on node2, the resources just get stopped, but not migrated back to node1.
4. They do start up successfully on node1 if I do a "cleanup resource" on the resource group at this state.
5. The strange thing is, at this point, if I do a "ifup bond0" on node2 and "ifdown bond0" on node1, the resources do migrate successfully to node2.

Not sure what is going on. I can see the following on node1's /var/log/messages.

Jul 21 11:02:41 node1 crm_resource: [6731]: ERROR: unpack_rsc_op: Hard error - p_vip-1_start_0 failed with rc=2: Preventing p_vip-1 from re-starting on node1

Is this what is stopping the resources and not migrating them to node1. Any idea what is going on?

The crm config is here.

node node1
node node2
primitive p_dlm-1 ocf:pacemaker:controld \
	operations $id="p_dlm-1-operations" \
	op monitor interval="120" timeout="20" start-delay="0" \
	params daemon="dlm_controld.pcmk"
primitive p_mysql-1 ocf:heartbeat:mysql \
	operations $id="p_mysql-1-operations" \
	op monitor interval="10s" timeout="15s" start-delay="15" \
	params datadir="/var/lib/mysql/data1" socket="/var/lib/mysql/data1/mysql.sock" \
	meta target-role="started"
primitive p_ocfs2-1 ocf:heartbeat:Filesystem \
	operations $id="p_ocfs2-1-operations" \
	op monitor interval="20" timeout="40" \
	params device="/dev/mapper/mysql01" directory="/var/lib/mysql/data1" fstype="ocfs2" \
	meta target-role="started"
primitive p_ocfs2control-1 ocf:ocfs2:o2cb \
	operations $id="p_ocfs2control-1-operations" \
	op monitor interval="120" timeout="20" start-delay="0" \
	params stack="pcmk"
primitive p_vip-1 ocf:heartbeat:IPaddr2 \
	operations $id="p_vip-1-operations" \
	op monitor interval="60s" timeout="10s" \
	params ip="10.200.31.103" broadcast="10.200.31.255" cidr_netmask="255.255.255.0" \
	meta target-role="started"
primitive stonith-1 stonith:external/riloe \
	meta target-role="started" \
	operations $id="stonith-1-operations" \
	op monitor interval="600" timeout="60" start-delay="0" \
	params hostlist="node1" ilo_hostname="node1rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
primitive stonith-2 stonith:external/riloe \
	meta target-role="started" \
	operations $id="stonith-2-operations" \
	op monitor interval="600" timeout="60" start-delay="0" \
	params hostlist="node2" ilo_hostname="node2rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
group g_mysql-1 p_vip-1 p_mysql-1 \
	meta target-role="started"
clone c_dlm-1 p_dlm-1 \
	meta interleave="true" target-role="started"
clone c_ocfs2-1 p_ocfs2-1 \
	meta interleave="true" target-role="started"
clone c_ocfs2control-1 p_ocfs2control-1 \
	meta interleave="true" target-role="started"
location stonith-1-never-on-node1 stonith-1 -inf: node1
location stonith-2-never-on-node2 stonith-2 -inf: node2
colocation g_mysql-1-with-ocfs2-1 inf: g_mysql-1 c_ocfs2-1
colocation ocfs2-1-with-ocfs2control-1 inf: c_ocfs2-1 c_ocfs2control-1
colocation ocfs2control-1-with-dlm-1 inf: c_ocfs2control-1 c_dlm-1
order start-mysql-1-after-ocfs2-1 : c_ocfs2-1 g_mysql-1
order start-ocfs2-1-after-ocfs2control-1 : c_ocfs2control-1 c_ocfs2-1
order start-ocfs2control-1-after-dlm-1 : c_dlm-1 c_ocfs2control-1
property $id="cib-bootstrap-options" \
	dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	no-quorum-policy="ignore" \
	last-lrm-refresh="1311260565" \
	stonith-timeout="30s" \
	start-failure-is-fatal="false"

Thanks a ton,
Prakash


More information about the Pacemaker mailing list