[Pacemaker] Trouble setting up IP failover with ping resource

Fri Feb 17 11:26:44 UTC 2012

Hi,

On Thu, Feb 16, 2012 at 07:57:14PM -0800, Anlu Wang wrote:
> I have three machines named anlutest1, anlutest2, and anlutest3 that I'm
> trying to get IP failover working on. I'm using heartbeat for the messaging
> layer, and everything works great when a machine goes down. But I also
> would like to failover an IP when EITHER the eth0 or eth1 network
> interfaces fail. From reading
> 
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch09s03s03.html
> 
> it seems the right way to do this is to add a ping resource.
> 
> Here is my XML configuration:
> 
> http://pastebin.com/05z7eB2s

The configuration seems OK, though obviously monitors are
scheduled back-to-back (the postponed operations messages below).
I guess that you should increase the intervals or reduce the
dampen period. Which version of Pacemaker do you run? Perhaps
also take a look at this thread:

http://oss.clusterlabs.org/pipermail/pacemaker/2011-April/009942.html

Thanks,

Dejan

> This config doesn't work for me. Using the showscores.sh script found at:
> 
> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg00410.html
> 
> I see that my scores are:
> 
> Resource                       Score     Node      Stickiness #Fail
>  Migration-Threshold
> address01                      0         anlutest3 0          0
> 
> address01                      1006      anlutest1 0          5
> 
> address01                      50        anlutest2 0          157
> 
> address02                      0         anlutest3 0          0
> 
> address02                      1050      anlutest2 0          2
> 
> address02                      6         anlutest1 0          0
> 
> address03                      1000      anlutest3 0          7
> 
> address03                      50        anlutest2 0
> 
> address03                      6         anlutest1 0          0
> 
> ping:0                         0         anlutest1 0          6
> 
> ping:0                         0         anlutest2 0          14
> 
> ping:0                         0         anlutest3 0          0
> 
> ping:1                         0         anlutest2 0
> 
> ping:1                         0         anlutest3 0          28
> 
> ping:1                         -1000000  anlutest1 0          0
> 
> ping:2                         0         anlutest3 0          13
> 
> ping:2                         -1000000  anlutest1 0          0
> 
> ping:2                         -1000000  anlutest2 0
> 
> which make no sense at all. I don't see how I could be getting these scores
> of 50 and 1006. When I take down an interface on anlutest3, I see scores of
> 4 and 1004, which sort of make sense, just the multiplier of 100 isn't
> working. I was experimenting with changing values, so maybe its caching old
> values. If so, how do I enforce the new values?
> 
> Furthermore, shouldn't there be no scores of 0? If all 6 IPs I am pinging
> return successfully, shouldn't my scores be either 600 or 1600?
> 
> In my syslog I also see a ton of messages like
> 
> Feb 17 03:54:47 anlutest2 lrmd: [1137]: info: perform_op:2877: operations
> on resource address01 already delayed
> Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873: operation
> monitor[419] on ocf::ping::ping:1 for client 1140, its parameters:
> CRM_meta_clone=[1] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
> 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
> crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] multiplier=[10000]
> CRM_meta_name=[monitor] CRM_meta_timeout=[60000] CRM_meta_interval=[5000]
>  for rsc is already running.
> Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883: postponing
> all ops on resource ping:1 by 1000 ms
> Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873: operation
> monitor[171] on ocf::ping::ping:2 for client 1140, its parameters:
> CRM_meta_clone=[2] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
> 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
> crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] multiplier=[1]
> CRM_meta_name=[monitor] CRM_meta_timeout=[30000] CRM_meta_interval=[5000]
>  for rsc is already running.
> Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883: postponing
> all ops on resource ping:2 by 1000 ms
> 
> and occasionally
> 
> Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_trigger_update:
> Sending flush op to all hosts for: pingd (4000)
> Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_ha_callback: flush
> message from anlutest2
> Feb 17 03:54:33 anlutest2 attrd: [1139]: WARN: find_nvpair_attr: Multiple
> attributes match name=pingd
> Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:   Value:
> 50 #011(id=status-d619a94e-ebba-4ed0-8e0f-89837dd7506b-pingd)
> Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:   Value: 3
> #011(id=status-ab3c1a25-9471-48f7-9c0b-c76238abd402-pingd)
> Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_perform_update: Sent
> update -40: pingd=4000
> Feb 17 03:54:33 anlutest2 attrd: [1139]: ERROR: attrd_cib_callback: Update
> -40 for pingd=4000 failed: Required data for this CIB API call not found
> 
> Could someone just take a look at my config and let me know what I'm doing
> wrong? Or if there's a better way to do what I want to do...
> 
> Thanks in advance,
> Anlu

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org