[Pacemaker] Trouble setting up IP failover with ping resource

Sat Feb 18 04:39:12 CET 2012

I figured it out, turns out there are some undocumented properties for
ping. What was happening was it was using the default monitor timeout of 20
seconds, but it was killing the ping process after this time, but the ping
wasn't finished yet.

See:

http://hg.clusterlabs.org/pacemaker/stable-1.0/raw-file/tip/extra/resources/ping

Anlu

On Fri, Feb 17, 2012 at 11:09 AM, Anlu Wang <anlu at mixpanel.com> wrote:

> Also, in my constraints section, for the ping connectivity resource
> location definitions, a node attribute is not specified on rsc_location.
> What is the default value of node then?
>
> Anlu
>
>
> On Fri, Feb 17, 2012 at 10:57 AM, Anlu Wang <anlu at mixpanel.com> wrote:
>
>> I'm running 1.0.8. In accordance with the bug in the post you linked, I
>> changed the config so that interval is greater than dampen. Here is the
>> relevant section now:
>>
>>       <clone id="connectivity_resource">
>>         <primitive class="ocf" id="ping" provider="pacemaker" type="ping">
>>           <instance_attributes id="ping-attrs">
>>             <nvpair id="pingd-dampen" name="dampen" value="5s"/>
>>             <nvpair id="pingd-multiplier" name="multiplier" value="100"/>
>>             <nvpair id="pingd-hosts" name="host_list" value="10.54.130.6
>> 10.54.130.8 10.54.130.7 50.97.196.101 50.97.196.103 50.97.196.102"/>
>>           </instance_attributes>
>>           <operations>
>>             <op id="ping-monitor-10s" interval="10s" name="monitor"
>> timeout="60s"/>
>>           </operations>
>>         </primitive>
>>         <meta_attributes id="connectivity_resource-meta_attributes">
>>           <nvpair id="connectivity_resource-meta_attributes-target-role"
>> name="target-role" value="Started"/>
>>         </meta_attributes>
>>       </clone>
>>
>> The scores are still not what I expect however, and when I disable the
>> internal interface on a node, nothing happens with failover.
>>
>> Also, I've noticed this in my syslog:
>>
>> Feb 17 06:26:11 anlutest2 lrmd: [1137]: WARN: ping:1:monitor process (PID
>> 9380) timed out (try 1).  Killing with signal SIGTERM (15).
>> Feb 17 06:26:11 anlutest2 lrmd: [1137]: info: RA output:
>> (ping:1:monitor:stderr) Terminated
>> Feb 17 06:26:11 anlutest2 ping[9380]: [15745]: INFO: They use TERM to
>> bring us down. No such luck.
>> Feb 17 06:26:11 anlutest2 ping[9380]: [15747]: ERROR: Unexpected result
>> for 'ping -n -q -W 3 -c 5  50.97.196.103' 143:
>>
>> So it looks like the ping command is failing for some reason, but when I
>> run it manually, it succeeds...
>>
>> Really at a loss here, any help is appreciated!
>>
>> Anlu
>>
>> On Fri, Feb 17, 2012 at 3:26 AM, Dejan Muhamedagic <dejanmm at fastmail.fm>wrote:
>>
>>> Hi,
>>>
>>> On Thu, Feb 16, 2012 at 07:57:14PM -0800, Anlu Wang wrote:
>>> > I have three machines named anlutest1, anlutest2, and anlutest3 that
>>> I'm
>>> > trying to get IP failover working on. I'm using heartbeat for the
>>> messaging
>>> > layer, and everything works great when a machine goes down. But I also
>>> > would like to failover an IP when EITHER the eth0 or eth1 network
>>> > interfaces fail. From reading
>>> >
>>> >
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/ch09s03s03.html
>>> >
>>> > it seems the right way to do this is to add a ping resource.
>>> >
>>> > Here is my XML configuration:
>>> >
>>> > http://pastebin.com/05z7eB2s
>>>
>>> The configuration seems OK, though obviously monitors are
>>> scheduled back-to-back (the postponed operations messages below).
>>> I guess that you should increase the intervals or reduce the
>>> dampen period. Which version of Pacemaker do you run? Perhaps
>>> also take a look at this thread:
>>>
>>> http://oss.clusterlabs.org/pipermail/pacemaker/2011-April/009942.html
>>>
>>> Thanks,
>>>
>>> Dejan
>>>
>>> > This config doesn't work for me. Using the showscores.sh script found
>>> at:
>>> >
>>> >
>>> http://www.mail-archive.com/pacemaker@oss.clusterlabs.org/msg00410.html
>>> >
>>> > I see that my scores are:
>>> >
>>> > Resource                       Score     Node      Stickiness #Fail
>>> >  Migration-Threshold
>>> > address01                      0         anlutest3 0          0
>>> >
>>> > address01                      1006      anlutest1 0          5
>>> >
>>> > address01                      50        anlutest2 0          157
>>> >
>>> > address02                      0         anlutest3 0          0
>>> >
>>> > address02                      1050      anlutest2 0          2
>>> >
>>> > address02                      6         anlutest1 0          0
>>> >
>>> > address03                      1000      anlutest3 0          7
>>> >
>>> > address03                      50        anlutest2 0
>>> >
>>> > address03                      6         anlutest1 0          0
>>> >
>>> > ping:0                         0         anlutest1 0          6
>>> >
>>> > ping:0                         0         anlutest2 0          14
>>> >
>>> > ping:0                         0         anlutest3 0          0
>>> >
>>> > ping:1                         0         anlutest2 0
>>> >
>>> > ping:1                         0         anlutest3 0          28
>>> >
>>> > ping:1                         -1000000  anlutest1 0          0
>>> >
>>> > ping:2                         0         anlutest3 0          13
>>> >
>>> > ping:2                         -1000000  anlutest1 0          0
>>> >
>>> > ping:2                         -1000000  anlutest2 0
>>> >
>>> > which make no sense at all. I don't see how I could be getting these
>>> scores
>>> > of 50 and 1006. When I take down an interface on anlutest3, I see
>>> scores of
>>> > 4 and 1004, which sort of make sense, just the multiplier of 100 isn't
>>> > working. I was experimenting with changing values, so maybe its
>>> caching old
>>> > values. If so, how do I enforce the new values?
>>> >
>>> > Furthermore, shouldn't there be no scores of 0? If all 6 IPs I am
>>> pinging
>>> > return successfully, shouldn't my scores be either 600 or 1600?
>>> >
>>> > In my syslog I also see a ton of messages like
>>> >
>>> > Feb 17 03:54:47 anlutest2 lrmd: [1137]: info: perform_op:2877:
>>> operations
>>> > on resource address01 already delayed
>>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873:
>>> operation
>>> > monitor[419] on ocf::ping::ping:1 for client 1140, its parameters:
>>> > CRM_meta_clone=[1] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
>>> > 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
>>> > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false]
>>> multiplier=[10000]
>>> > CRM_meta_name=[monitor] CRM_meta_timeout=[60000]
>>> CRM_meta_interval=[5000]
>>> >  for rsc is already running.
>>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883:
>>> postponing
>>> > all ops on resource ping:1 by 1000 ms
>>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2873:
>>> operation
>>> > monitor[171] on ocf::ping::ping:2 for client 1140, its parameters:
>>> > CRM_meta_clone=[2] host_list=[10.54.130.6 10.54.130.8 10.54.130.7
>>> > 50.97.196.101 50.97.196.103 50.9CRM_meta_clone_max=[3] dampen=[60s]
>>> > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] multiplier=[1]
>>> > CRM_meta_name=[monitor] CRM_meta_timeout=[30000]
>>> CRM_meta_interval=[5000]
>>> >  for rsc is already running.
>>> > Feb 17 03:54:48 anlutest2 lrmd: [1137]: info: perform_op:2883:
>>> postponing
>>> > all ops on resource ping:2 by 1000 ms
>>> >
>>> > and occasionally
>>> >
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_trigger_update:
>>> > Sending flush op to all hosts for: pingd (4000)
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_ha_callback: flush
>>> > message from anlutest2
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: WARN: find_nvpair_attr:
>>> Multiple
>>> > attributes match name=pingd
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:
>>> Value:
>>> > 50 #011(id=status-d619a94e-ebba-4ed0-8e0f-89837dd7506b-pingd)
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: find_nvpair_attr:
>>> Value: 3
>>> > #011(id=status-ab3c1a25-9471-48f7-9c0b-c76238abd402-pingd)
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: info: attrd_perform_update:
>>> Sent
>>> > update -40: pingd=4000
>>> > Feb 17 03:54:33 anlutest2 attrd: [1139]: ERROR: attrd_cib_callback:
>>> Update
>>> > -40 for pingd=4000 failed: Required data for this CIB API call not
>>> found
>>> >
>>> > Could someone just take a look at my config and let me know what I'm
>>> doing
>>> > wrong? Or if there's a better way to do what I want to do...
>>> >
>>> > Thanks in advance,
>>> > Anlu
>>>
>>> > _______________________________________________
>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > Project Home: http://www.clusterlabs.org
>>> > Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120217/286ecac9/attachment-0001.html>