[Pacemaker] Service failback issue with SLES11 and HAE 11

Tue Jun 8 12:26:48 EDT 2010

Hi,

On Tue, Jun 08, 2010 at 10:00:37AM +0800, ben180 wrote:
> Dear all,
> 
> There are two nodes in my customer's environment. We installed SuSE
> Linux Enterprise Server 11 and HAE on the two node. The cluster is for
> oracle database service HA purpose.
> We have set clone resource for pingd, and constraints for detecting if
> network interface is down. And we added a resource group which
> includes Filesystem, IPAddr, oracle, oralsnr primitives, and use sbd
> for fence device.
> 
> There's our settings :
> 
> - <resources>
> - <clone id="Connected">
> - <meta_attributes id="Connected-meta_attributes">
>   <nvpair id="nvpair-c25cd67c-f681-4652-8007-64c0e50fabe6"
> name="clone-max" value="2" />
>   <nvpair id="nvpair-28b5d065-7984-424d-b5a8-fb8fc7b2f6dc"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <primitive class="ocf" id="ping" provider="pacemaker" type="pingd">
> - <meta_attributes id="ping-meta_attributes">
>   <nvpair id="nvpair-e7f1cec6-f5a7-4db2-b20b-0002ad31d9fa"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <operations id="ping-operations">
>   <op id="ping-op-monitor-10" interval="10" name="monitor"
> start-delay="1m" timeout="20" />
>   </operations>
> - <instance_attributes id="ping-instance_attributes">
>   <nvpair id="nvpair-cab0ee56-cd6c-47b7-b206-a19bce16d445"
> name="dampen" value="5" />
>   <nvpair id="nvpair-c660e572-40ba-4166-9293-6e99f5d024e8"
> name="host_list" value="10.224.1.254" />
>   </instance_attributes>
>   </primitive>
>   </clone>
> - <group id="DB">
> - <meta_attributes id="DB-meta_attributes">
>   <nvpair id="nvpair-a0ce4033-555a-40c3-8e92-191552596a97"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <primitive class="ocf" id="FileSystem" provider="heartbeat" type="Filesystem">
> - <meta_attributes id="FileSystem-meta_attributes">
>   <nvpair id="nvpair-6e46d65a-86d4-41d4-9c7f-7ea502ca9f36"
> name="target-role" value="started" />
>   </meta_attributes>
> - <operations id="FileSystem-operations">
>   <op id="FileSystem-op-monitor-20" interval="20" name="monitor"
> start-delay="10" timeout="40" />
>   </operations>
> - <instance_attributes id="FileSystem-instance_attributes">
>   <nvpair id="nvpair-99da66a3-ebdf-4c3b-9647-05a065ff8309"
> name="device" value="/dev/dm-0" />
>   <nvpair id="nvpair-c882d532-3fc5-41a4-b1a3-6b03b2b3d54d"
> name="directory" value="/oracle" />
>   <nvpair id="nvpair-643ad766-eb95-4667-8b33-452f8266ba10"
> name="fstype" value="ext3" />
>   </instance_attributes>
>   </primitive>
> - <primitive class="ocf" id="ServiceIP" provider="heartbeat" type="IPaddr">
> - <meta_attributes id="ServiceIP-meta_attributes">
>   <nvpair id="nvpair-03afe5cc-226f-43db-b1e5-ee2f5f1cb66e"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <operations id="ServiceIP-operations">
>   <op id="ServiceIP-op-monitor-5s" interval="5s" name="monitor"
> start-delay="1s" timeout="20s" />
>   </operations>
> - <instance_attributes id="ServiceIP-instance_attributes">
>   <nvpair id="nvpair-10b45737-aa05-4a7f-9469-b1f75e138834" name="ip"
> value="10.224.1.138" />
>   </instance_attributes>
>   </primitive>
> - <primitive class="ocf" id="Instance" provider="heartbeat" type="oracle">
> - <meta_attributes id="Instance-meta_attributes">
>   <nvpair id="nvpair-2bbbe865-1339-4cbf-8add-8aa107736260"
> name="target-role" value="Started" />
>   <nvpair id="nvpair-6cfd1675-ce77-4c05-8031-242ed176b890"
> name="failure-timeout" value="1" />
>   <nvpair id="nvpair-e957ff0a-c40e-494d-b691-02d3ea67440b"
> name="migration-threshold" value="1" />
>   <nvpair id="nvpair-4283547a-4c34-4b26-b82f-50730dc4c4fa"
> name="resource-stickiness" value="INFINITY" />
>   <nvpair id="nvpair-9478fb7c-e1aa-405e-b4a9-c031971bc612"
> name="is-managed" value="true" />
>   </meta_attributes>
> - <operations id="Instance-operations">
>   <op enabled="true" id="Instance-op-monitor-120" interval="30"
> name="monitor" role="Started" start-delay="1m" timeout="240" />
>   </operations>
> - <instance_attributes id="Instance-instance_attributes">
>   <nvpair id="nvpair-30288e1c-e9e9-4360-b658-045f2f353704" name="sid"
> value="BpmDBp" />
>   </instance_attributes>
>   </primitive>
> - <primitive class="ocf" id="Listener" provider="heartbeat" type="oralsnr">
> - <meta_attributes id="Listener-meta_attributes">
>   <nvpair id="nvpair-f6219b53-5d6a-42cb-8dec-d8a17b0c240c"
> name="target-role" value="Started" />
>   <nvpair id="nvpair-ae38b2bd-b3ee-4a5a-baec-0a998ca7742d"
> name="failure-timeout" value="1" />
>   </meta_attributes>
> - <operations id="Listener-operations">
>   <op id="Listener-op-monitor-10" interval="10" name="monitor"
> start-delay="10" timeout="30" />
>   </operations>
> - <instance_attributes id="Listener-instance_attributes">
>   <nvpair id="nvpair-96615bed-b8a1-4385-a61c-0f399225e63e" name="sid"
> value="BpmDBp" />
>   </instance_attributes>
>   </primitive>
>   </group>
> - <clone id="Fence">
> - <meta_attributes id="Fence-meta_attributes">
>   <nvpair id="nvpair-59471c10-ec9d-4eb8-becc-ef6d91115614"
> name="clone-max" value="2" />
>   <nvpair id="nvpair-6af8cf4a-d96f-449b-9625-09a10c206a5f"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <primitive class="stonith" id="sbd-stonith" type="external/sbd">
> - <meta_attributes id="sbd-stonith-meta_attributes">
>   <nvpair id="nvpair-8c7cfdb7-ec19-4fcc-a7f8-39a9957326d2"
> name="target-role" value="Started" />
>   </meta_attributes>
> - <operations id="sbd-stonith-operations">
>   <op id="sbd-stonith-op-monitor-15" interval="15" name="monitor"
> start-delay="15" timeout="15" />
>   </operations>
> - <instance_attributes id="sbd-stonith-instance_attributes">
>   <nvpair id="nvpair-724e62b3-0b26-4778-807e-a457dbd2fe42"
> name="sbd_device" value="/dev/dm-1" />
>   </instance_attributes>
>   </primitive>
>   </clone>
>   </resources>

> After some failover test, we found something strange. Said that if the
> oracle service is running on node1. First we pull out the node1's
> network cable, we can see the node1 is fenced by node2, resulting
> node1 rebooting. Second the oracle db service will failover to node2,
> and this is what we expected. But third, after the node1 boot up and
> the network is up again, node2 is fenced by node1, and finally oracle
> db service is failback to node1---this is NOT what we want.

Yes, I can imagine that.

> We found after node1 rebooting, it seems that node1 cannot communicate
> with node2 via TOTEM. And node1 uses sbd to fence node2 and get the
> resource back. Is there something wrong with my settings?

Can't say for sure because cluster properties are missing.

> Or someone
> could give me some advice about this situation?
> 
> I've attached the log on both nodes and the pacemaker's settings, if
> you can, please pay some attention in node1's log below (tibcodb is
> node1 and tibcodb2 is node2) :
> 
> Jun  1 11:30:39 tibcodb openais[5018]: [TOTEM] Token Timeout (5000 ms)
> retransmit timeout (490 ms)
> Jun  1 11:30:39 tibcodb openais[5018]: [TOTEM] The network interface
> [10.224.1.89] is now up.
> .......................................................
> ......................................................
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ] CLM CONFIGURATION CHANGE
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ] New Configuration:
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ]    r(0) ip(10.224.1.89)
>  <======================== Why node2 ( tibcodb2 : 10.224.1.90 ) is not
> recognized, and only node1 is recognized?
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ] Members Left:
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ] Members Joined:
> Jun  1 11:30:39 tibcodb openais[5018]: [CLM  ]    r(0) ip(10.224.1.89)
> Jun  1 11:30:39 tibcodb openais[5018]: [crm  ] notice:
> pcmk_peer_update: Stable membership event on ring 104: memb=1, new=1,
> lost=0
> Jun  1 11:30:39 tibcodb openais[5018]: [crm  ] info: pcmk_peer_update:
> NEW:  tibcodb 1
> Jun  1 11:30:39 tibcodb openais[5018]: [crm  ] info: pcmk_peer_update:
> MEMB: tibcodb 1
> Jun  1 11:30:39 tibcodb openais[5018]: [MAIN ] info: update_member:
> Node tibcodb now has process list: 00000000000000000000000000053312
> (340754)
> Jun  1 11:30:39 tibcodb openais[5018]: [SYNC ] This node is within the
> primary component and will provide service.
> ........................................
> .........................................
> Jun  1 11:30:57 tibcodb stonithd: [5026]: info: client tengine [pid:
> 5031] requests a STONITH operation RESET on node tibcodb2 <==== Why
> node1 want to fence node2 ?

Don't know, that part of the logs is missing.

> Jun  1 11:30:57 tibcodb stonithd: [5026]: info:
> stonith_operate_locally::2683: sending fencing op RESET for tibcodb2
> to sbd-stonith:0 (external/sbd) (pid=5398)
> Jun  1 11:30:57 tibcodb sbd: [5400]: info: tibcodb2 owns slot 1
> Jun  1 11:30:57 tibcodb sbd: [5400]: info: Writing reset to node slot tibcodb2
> Jun  1 11:31:07 tibcodb sbd: [5400]: info: reset successfully
> delivered to tibcodb2
> 
> 
> Please, any help would be appreciated.

The most likely reason is that there was something wrong with the
network. But it's really hard to say without full logs and
configuration and so on. You can prepare a hb_report. This being
SLES, best would be to open a call with your representative.

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker