[Pacemaker] Problem configuring Heartbeat with CRM : Abnormal Failover test results

Fri Jul 29 14:14:31 CET 2011

Hello,

First of all, please excuse my apporximative english/US..
I'm facing a problem with configuring a simple 2 nodes cluster with 1 
resource (Virtual IP)
I'v read and read a lot of threads and doc but... Don't find.
Have to say the cluster's world is pretty new for me...

I've installed on my 2 linux servers RHEL 4.1.2-48 following packages :

cluster-glue-1.0.5-1.el5.x86_64.rpm
cluster-glue-libs-1.0.5-1.el5.x86_64.rpm
corosync-1.2.5-1.3.el5.x86_64.rpm
corosynclib-1.2.5-1.3.el5.x86_64.rpm
heartbeat-3.0.3-2.el5.x86_64.rpm
heartbeat-libs-3.0.3-2.el5.x86_64.rpm
libesmtp-1.0.4-5.el5.x86_64.rpm
pacemaker-1.0.9.1-1.el5.x86_64.rpm
pacemaker-libs-1.0.9.1-1.el5.x86_64.rpm
resource-agents-1.0.3-2.el5.x86_64.rpm

(corosync not running, It seems that I don't need it)

Here under the ha.cf of node 1 :
node 1
autojoin none
keepalive 2
deadtime 10
initdead 80
udpport 694
ucast bond0 <@IP node2>
auto_failback off
node    node1
node    node2
use_logd yes
crm     yes

Here under the ha.cf of node 2 :
node 1
autojoin none
keepalive 2
deadtime 10
initdead 80
udpport 694
ucast bond0 <@IP node1>
auto_failback off
node    node1
node    node2
use_logd yes
crm     yes

I use crm to configure the cluster, here is the cib.xml file :

<cib validate-with="pacemaker-1.0" crm_feature_set="3.0.1" 
have-quorum="1" admin_epoch="0" epoch="190" 
dc-uuid="85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4" num_
updates="0" cib-last-written="Fri Jul 29 14:18:28 2011">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.0.9-89bd754939df5150de7cd76835f98fe90851b677"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="Heartbeat"/>
<nvpair id="cib-bootstrap-options-stonith-enabled" 
name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-last-lrm-refresh" 
name="last-lrm-refresh" value="1311941556"/>
</cluster_property_set>
</crm_config>
<nodes>
<node type="normal" uname="node2" id="85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4">
<instance_attributes id="nodes-85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4">
<nvpair name="standby" 
id="nodes-85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4-standby" value="off"/>
</instance_attributes>
</node>
<node id="813121d2-360b-4532-8883-7f1330ed2c39" type="normal" uname="node1">
<instance_attributes id="nodes-813121d2-360b-4532-8883-7f1330ed2c39">
<nvpair id="nodes-813121d2-360b-4532-8883-7f1330ed2c39-standby" 
name="standby" value="off"/>
</instance_attributes>
</node>
</nodes>
<resources>
<primitive class="ocf" id="ClusterIP" provider="heartbeat" type="IPaddr2">
<instance_attributes id="ClusterIP-instance_attributes">
<nvpair id="ClusterIP-instance_attributes-ip" name="ip" value="<@IP 
Virtual>"/>
<nvpair id="ClusterIP-instance_attributes-cidr_netmask" 
name="cidr_netmask" value="32"/>
</instance_attributes>
<operations>
<op id="ClusterIP-monitor-30s" interval="30s" name="monitor"/>
</operations>
<meta_attributes id="ClusterIP-meta_attributes">
<nvpair id="ClusterIP-meta_attributes-target-role" name="target-role" 
value="Started"/>
</meta_attributes>
</primitive>
</resources>
<constraints/>
<rsc_defaults/>
<op_defaults/>
</configuration>
</cib>

Heartbeat demon starts well on both sides, here is the result of crm_mon :
============
Last updated: Fri Jul 29 14:49:47 2011
Stack: Heartbeat
Current DC: node1 (813121d2-360b-4532-8883-7f1330ed2c39) - partition with
  quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ node 2 node1]

ClusterIP       (ocf::heartbeat:IPaddr2):       Started node2

To test if everything works fine, il launch a script taht stop network 
on node2, waits 50s and then starts back network.
When the network goes down on node2, the resource migrates as expected 
on node1.
But when the network is back operational, resource does note move back 
to node2 (it should, as there's no stickiness option defined yet)
I have the following error on crm_mon :

============
Last updated: Fri Jul 29 14:52:15 2011
Stack: Heartbeat
Current DC: node2 (85f5f8dc-6ccf-4478-8a89-a3d7c952c0e4) - partition with
  quorum
Version: 1.0.9-89bd754939df5150de7cd76835f98fe90851b677
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ node1 node2]

ClusterIP       (ocf::heartbeat:IPaddr2):       Started node1

Failed actions:
     ClusterIP_start_0 (node=node2, call=6, rc=2, status=complete): 
invalid parameter

Same behaviour if I swap resource to node1 and start/stop network on node1.

Why is there this "invalid parameter" ??

Here is an extract of the ha-log :

Jul 29 14:48:30 node2 pengine: [30752]: ERROR: unpack_rsc_op: Hard error 
- ClusterIP_start_0 failed with rc=2: Preventing ClusterIP from 
re-starting on node2
Jul 29 14:48:30 node2 pengine: [30752]: WARN: unpack_rsc_op: Processing 
failed op ClusterIP_start_0 on node2: invalid parameter (2)
Jul 29 14:48:30 node2 pengine: [30752]: notice: native_print: 
ClusterIP     (ocf::heartbeat:IPaddr2):       Started node1
Jul 29 14:48:30 node2 pengine: [30752]: info: get_failcount: ClusterIP 
has failed INFINITY times on node2
Jul 29 14:48:30 node2 pengine: [30752]: WARN: common_apply_stickiness: 
Forcing ClusterIP away from node2 after 1000000 failures (max=1000000)
Jul 29 14:48:30 node2 pengine: [30752]: notice: LogActions: Leave 
resource ClusterIP        (Started node1)

If you need more info, please ask me !

Thanks In advance

Olivier