[Pacemaker] Trouble setting up IP failover with ping resource

Fri Feb 17 20:09:20 CET 2012

Also, in my constraints section, for the ping connectivity resource
location definitions, a node attribute is not specified on rsc_location.
What is the default value of node then?

Anlu

On Fri, Feb 17, 2012 at 10:57 AM, Anlu Wang <anlu at mixpanel.com> wrote:

> I'm running 1.0.8. In accordance with the bug in the post you linked, I
> changed the config so that interval is greater than dampen. Here is the
> relevant section now:
>
>       <clone id="connectivity_resource">
>         <primitive class="ocf" id="ping" provider="pacemaker" type="ping">
>           <instance_attributes id="ping-attrs">
>             <nvpair id="pingd-dampen" name="dampen" value="5s"/>
>             <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
>             <nvpair id="pingd-hosts" name="host_list" value="10.54.130.6
> 10.54.130.8 10.54.130.7 50.97.196.101 50.97.196.103 50.97.196.102"/>
>           </instance_attributes>
>           <operations>
>             <op id="ping-monitor-10s" interval="10s" name="monitor"
> timeout="60s"/>
>           </operations>
>         </primitive>
>         <meta_attributes id="connectivity_resource-meta_attributes">
>           <nvpair id="connectivity_resource-meta_attributes-target-role"
> name="target-role" value="Started"/>
>         </meta_attributes>
>       </clone>
>
> The scores are still not what I expect however, and when I disable the
> internal interface on a node, nothing happens with failover.
>
> Also, I've noticed this in my syslog:
>
> Feb 17 06:26:11 anlutest2 lrmd: [1137]: WARN: ping:1:monitor process (PID
> 9380) timed out (try 1).  Killing with signal SIGTERM (15).
> Feb 17 06:26:11 anlutest2 lrmd: [1137]: info: RA output:
> (ping:1:monitor:stderr) Terminated
> Feb 17 06:26:11 anlutest2 ping[9380]: [15745]: INFO: They use TERM to
> bring us down. No such luck.
> Feb 17 06:26:11 anlutest2 ping[9380]: [15747]: ERROR: Unexpected result
> for 'ping -n -q -W 3 -c 5  50.97.196.103' 143:
>
> So it looks like the ping command is failing for some reason, but when I
> run it manually, it succeeds...
>
> Really at a loss here, any help is appreciated!
>
> Anlu
>
> On Fri, Feb 17, 2012 at 3:26 AM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:
>
>> Hi,
>>
>> On Thu, Feb 16, 2012 at 07:57:14PM -0800, Anlu Wang wrote:
>> > I have three machines named anlutest1, anlutest2, and anlutest3 that I'm
>> > trying to get IP failover working on. I'm using heartbeat for the
>> messaging
>> > layer, and everything works great when a machine goes down. But I also
>> > would like to failover an IP when EITHER the eth0 or eth1 network
>> > interfaces fail. From reading
>> >
>> >
>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch09s03s03.html
>> >
>> > it seems the right way to do this is to add a ping resource.
>> >
>> > Here is my XML configuration:
>> >
>> > http://pastebin.com/05z7eB2s
>>
>> The configuration seems OK, though obviously monitors are
>> scheduled back-to-back (the postponed operations messages below).
>> I guess that you should increase the intervals or reduce the
>> dampen period. Which version of Pacemaker do you run? Perhaps
>> also take a look at this thread:
>>
>> http://oss.clusterlabs.org/pipermail/pacemaker/2011-April/009942.html
>>
>> Thanks,
>>
>> Dejan
>>
>> > This config doesn't work for me. Using the showscores.sh script found
>> at:
>> >
>> > http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg00410.html
>> >
>> > I see that my scores are:
>> >
>> > Resource                       Score     Node      Stickiness #Fail
>> >  Migration-Threshold
>> > address01                      0         anlutest3 0          0
>> >
>> > address01                      1006      anlutest1 0          5
>> >
>> > address01                      50        anlutest2 0          157
>> >
>> > address02                      0         anlutest3 0          0
>> >
>> > address02                      1050      anlutest2 0          2
>> >
>> > address02                      6         anlutest1 0          0
>> >
>> > address03                      1000      anlutest3 0          7
>> >
>> > address03                      50        anlutest2 0
>> >
>> > address03                      6         anlutest1 0          0
>> >
>> > ping:0                         0         anlutest1 0          6
>> >
>> > ping:0                         0         anlutest2 0          14
>> >
>> > ping:0                         0         anlutest3 0          0
>> >
>> > ping:1                         0         anlutest2 0
>> >
>> > ping:1                         0         anlutest3 0          28
>> >
>> > ping:1                         -1000000  anlutest1 0          0
>> >
>> > ping:2                         0         anlutest3 0          13
>> >
>> > ping:2                         -1000000  anlutest1 0          0
>> >
>> > ping:2                         -1000000  anlutest2 0
>> >
>> > which make no sense at all. I don't see how I could be getting these
>> scores
>> > of 50 and 1006. When I take down an interface on anlutest3, I see
>> scores of
>> > 4 and 1004, which sort of make sense, just the multiplier of 100 isn't
>> > working. I was experimenting with changing values, so maybe its caching
>> old
>> > values. If so, how do I enforce the new values?
>> >
>> > Furthermore, shouldn't there be no scores of 0? If all 6 IPs I am
>> pinging
>> > return successfully, shouldn't my scores be either 600 or 1600?
>> >
>> > In my syslog I also see a ton of messages like
>> >
>> > Feb 17 03:54:47 anlutest2 lrmd: [1137]: info: perform_op:2877:
>> operations
>> > on resource address01 already delayed
>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873: operation
>> > monitor[419] on ocf::ping::ping:1 for client 1140, its parameters:
>> > CRM_meta_clone=[1] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
>> > 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
>> > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false]
>> multiplier=[10000]
>> > CRM_meta_name=[monitor] CRM_meta_timeout=[60000]
>> CRM_meta_interval=[5000]
>> >  for rsc is already running.
>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883:
>> postponing
>> > all ops on resource ping:1 by 1000 ms
>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873: operation
>> > monitor[171] on ocf::ping::ping:2 for client 1140, its parameters:
>> > CRM_meta_clone=[2] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
>> > 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
>> > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] multiplier=[1]
>> > CRM_meta_name=[monitor] CRM_meta_timeout=[30000]
>> CRM_meta_interval=[5000]
>> >  for rsc is already running.
>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883:
>> postponing
>> > all ops on resource ping:2 by 1000 ms
>> >
>> > and occasionally
>> >
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_trigger_update:
>> > Sending flush op to all hosts for: pingd (4000)
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_ha_callback: flush
>> > message from anlutest2
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: WARN: find_nvpair_attr:
>> Multiple
>> > attributes match name=pingd
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:
>> Value:
>> > 50 #011(id=status-d619a94e-ebba-4ed0-8e0f-89837dd7506b-pingd)
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:
>> Value: 3
>> > #011(id=status-ab3c1a25-9471-48f7-9c0b-c76238abd402-pingd)
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_perform_update:
>> Sent
>> > update -40: pingd=4000
>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: ERROR: attrd_cib_callback:
>> Update
>> > -40 for pingd=4000 failed: Required data for this CIB API call not found
>> >
>> > Could someone just take a look at my config and let me know what I'm
>> doing
>> > wrong? Or if there's a better way to do what I want to do...
>> >
>> > Thanks in advance,
>> > Anlu
>>
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120217/59c2be3a/attachment.html>