[Pacemaker] Resource does not come back to a node after node recovers from network issue

Sat Jul 23 15:23:57 CET 2011

On Jul 21, 2011, at 12:00 PM, Prakash Velayutham wrote:

> Hello all,
> 
> I have a 2 node cluster running
> 
> Corosync - 1.2.1
> Pacemaker - 1.1.2
> 
> Both nodes have the primary (production) and private (heartbeat) networks bonded across 2 separate ethernet interfaces. eth0/eth1 for bond0 (primary) and eth2/eth3 for bond1 (private). I am trying to test the migration of resources by downing the production bond.
> 
> I am seeing a strange issue as described below.
> 
> 1. Assume the resources are currently hosted at node1.
> 2. If I do "ifdown bond0", I can see that g_mysqlp-1 group resource migrates to node2.
> 3. If I do "ifup bond0" on node1 and then do a "ifdown bond0" on node2, the resources just get stopped, but not migrated back to node1.
> 4. They do start up successfully on node1 if I do a "cleanup resource" on the resource group at this state.
> 5. The strange thing is, at this point, if I do a "ifup bond0" on node2 and "ifdown bond0" on node1, the resources do migrate successfully to node2.
> 
> Not sure what is going on. I can see the following on node1's /var/log/messages.
> 
> Jul 21 11:02:41 node1 crm_resource: [6731]: ERROR: unpack_rsc_op: Hard error - p_vip-1_start_0 failed with rc=2: Preventing p_vip-1 from re-starting on node1
> 
> Is this what is stopping the resources and not migrating them to node1. Any idea what is going on?
> 
> The crm config is here.
> 
> node node1
> node node2
> primitive p_dlm-1 ocf:pacemaker:controld \
> 	operations $id="p_dlm-1-operations" \
> 	op monitor interval="120" timeout="20" start-delay="0" \
> 	params daemon="dlm_controld.pcmk"
> primitive p_mysql-1 ocf:heartbeat:mysql \
> 	operations $id="p_mysql-1-operations" \
> 	op monitor interval="10s" timeout="15s" start-delay="15" \
> 	params datadir="/var/lib/mysql/data1" socket="/var/lib/mysql/data1/mysql.sock" \
> 	meta target-role="started"
> primitive p_ocfs2-1 ocf:heartbeat:Filesystem \
> 	operations $id="p_ocfs2-1-operations" \
> 	op monitor interval="20" timeout="40" \
> 	params device="/dev/mapper/mysql01" directory="/var/lib/mysql/data1" fstype="ocfs2" \
> 	meta target-role="started"
> primitive p_ocfs2control-1 ocf:ocfs2:o2cb \
> 	operations $id="p_ocfs2control-1-operations" \
> 	op monitor interval="120" timeout="20" start-delay="0" \
> 	params stack="pcmk"
> primitive p_vip-1 ocf:heartbeat:IPaddr2 \
> 	operations $id="p_vip-1-operations" \
> 	op monitor interval="60s" timeout="10s" \
> 	params ip="10.200.31.103" broadcast="10.200.31.255" cidr_netmask="255.255.255.0" \
> 	meta target-role="started"
> primitive stonith-1 stonith:external/riloe \
> 	meta target-role="started" \
> 	operations $id="stonith-1-operations" \
> 	op monitor interval="600" timeout="60" start-delay="0" \
> 	params hostlist="node1" ilo_hostname="node1rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
> primitive stonith-2 stonith:external/riloe \
> 	meta target-role="started" \
> 	operations $id="stonith-2-operations" \
> 	op monitor interval="600" timeout="60" start-delay="0" \
> 	params hostlist="node2" ilo_hostname="node2rilo.chmcres.cchmc.org" ilo_user="xxxx" ilo_password="xxxx" ilo_can_reset="1" ilo_protocol="2.0" ilo_powerdown_method="power"
> group g_mysql-1 p_vip-1 p_mysql-1 \
> 	meta target-role="started"
> clone c_dlm-1 p_dlm-1 \
> 	meta interleave="true" target-role="started"
> clone c_ocfs2-1 p_ocfs2-1 \
> 	meta interleave="true" target-role="started"
> clone c_ocfs2control-1 p_ocfs2control-1 \
> 	meta interleave="true" target-role="started"
> location stonith-1-never-on-node1 stonith-1 -inf: node1
> location stonith-2-never-on-node2 stonith-2 -inf: node2
> colocation g_mysql-1-with-ocfs2-1 inf: g_mysql-1 c_ocfs2-1
> colocation ocfs2-1-with-ocfs2control-1 inf: c_ocfs2-1 c_ocfs2control-1
> colocation ocfs2control-1-with-dlm-1 inf: c_ocfs2control-1 c_dlm-1
> order start-mysql-1-after-ocfs2-1 : c_ocfs2-1 g_mysql-1
> order start-ocfs2-1-after-ocfs2control-1 : c_ocfs2control-1 c_ocfs2-1
> order start-ocfs2control-1-after-dlm-1 : c_dlm-1 c_ocfs2control-1
> property $id="cib-bootstrap-options" \
> 	dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
> 	cluster-infrastructure="openais" \
> 	expected-quorum-votes="2" \
> 	no-quorum-policy="ignore" \
> 	last-lrm-refresh="1311260565" \
> 	stonith-timeout="30s" \
> 	start-failure-is-fatal="false"
> 
> Thanks a ton,
> Prakash

Hi,

An update. This issue has resolved when I started using a ocf::pacemaker::ping clone resource per http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch09s03s03.html. Is this the right way to go?

Thanks,
Prakash