[Pacemaker] Loss of ocf:pacemaker:ping target forces resources to restart?

Tue Sep 4 18:06:17 EDT 2012

On Tue, Aug 28, 2012 at 6:44 AM, Andrew Martin <amartin at xes-inc.com> wrote:
> Hi Jake,
>
> Thank you for the detailed analysis of this problem. The original reason I
> was utilizing ocf:pacemaker:ping was to ensure that the node with the best
> network connectivity (network connectivity being judged by the ability to
> communicate with 192.168.0.128 and 192.168.0.129) would be the one running
> the resources. However, it is possible that that either of these IPs could
> be down for maintenance or a hardware failure, and the cluster should not be
> affected by this. It seems that a synchronous ping check from all of the
> nodes would ensure this behavior without this unfortunate side-effect.
>
> Is there another way to achieve the same network connectivity check instead
> of using ocf:pacemaker:ping? I know the other *ping* resource agents are
> deprecated.

With the correct value of dampen, things should behave as expected
regardless of which ping variant is used.

>
> Thanks,
>
> Andrew
>
> ________________________________
> From: "Jake Smith" <jsmith at argotec.com>
> To: "Andrew Martin" <amartin at xes-inc.com>
> Cc: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, August 27, 2012 1:47:25 PM
>
> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
> resources        to restart?
>
>
> ----- Original Message -----
>> From: "Andrew Martin" <amartin at xes-inc.com>
>> To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster resource
>> manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, August 27, 2012 1:01:54 PM
>> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
>> resources        to restart?
>>
>> Jake,
>>
>>
>> Attached is the log from the same period for node2. If I am reading
>> this correctly, it looks like there was a 7 second difference
>> between when node1 set its score to 1000 and when node2 set its
>> score to 1000?
>
> I agree and (I think) more importantly this is what caused the issue to the
> best of my knowledge - not necessarily fact ;-)
>
> At 10:40:43 node1 updates it's pingd to 1000 causing the policy engine to
> recalculate node preference
> At 10:40:44 transition 760 is initiated to move everything to the more
> preferred node2 because it's pingd value is 2000
> At 10:40:50 node2's pingd value drops to 1000.  Policy engine doesn't
> stop/change the in-process transition - node1 and 2 are equal now but the
> transition is in process and node1 isn't more preferred so it continues.
> At 10:41:02 ping is back on node1 and ready to update pingd to 2000
> At 10:41:07 after dampen node1 updates pingd to 2000 which is greater than
> node2's value
> At 10:41:08 cluster recognizes change in pingd value that requires a
> recalculation of node preference and aborts the in-process transition (760).
> I believe the cluster then waits for all in-process actions to complete so
> the cluster is in a known state to recalculate
> At 10:42:10 I'm guessing the shutdown timeout is reached without completing
> so then VirtualDomain is forcibly shutdown
> Once all of that is done the transition 760 is done stopping/aborting with
> some transactions completed and some not:
>
> Aug 22 10:42:13 node1 crmd: [4403]: notice: run_graph: Transition 760
> (Complete=20, Pending=0, Fired=0, Skipped=39, Incomplete=30,
> Source=/var/lib/pengine/pe-input-2952.bz2): Stopped
> Then the cluster recalculates the node preference and restarts those
> services that are stopped on node1 because pingd scores between node1 and
> node2 are equal so there is preference to stay on node1 where some services
> are still active (drbd or such I'm guessing are still running on node1)
>
>
>> Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked:
>> attrd_updater -n p_ping -v 1000 -d 5s
>
> Before this is the ping fail:
>
> Aug 22 10:40:31 node1 ping[1668]: [1823]: WARNING: 192.168.0.128 is
> inactive: PING 192.168.0.128 (192.168.0.128) 56(84) bytes of
> data.#012#012--- 192.168.0.128 ping statistics ---#0128 packets transmitted,
> 0 received, 100% packet loss, time 7055ms
>
> Then you get the 7 second delay to do the 8 attempts I believe and then the
> 5 second dampen (-d 5s) brings us to:
>
>> Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update:
>> Sending flush op to all hosts for: p_ping (1000)
>> Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update:
>> Sent update 265: p_ping=1000
>>
>
> Same thing on node2 - fails at 10:40:38 and then 7 seconds later:
>
>> Aug 22 10:40:45 node2 attrd_updater: [27245]: info: Invoked:
>> attrd_updater -n p_ping -v 1000 -d 5s
>
> 5s Dampen
>
>> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_trigger_update:
>> Sending flush op to all hosts for: p_ping (1000)
>> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_perform_update:
>> Sent update 122: p_ping=1000
>>
>> I had changed the attempts value to 8 (from the default 2) to address
>> this same issue - to avoid resource migration based on brief
>> connectivity problems with these IPs - however if we can get dampen
>> configured correctly I'll set it back to the default.
>>
>
> Well after looking through both more closely I'm not sure dampen is what
> you'll need to fix the deeper problem.  The time between fail and return was
> 10:40:31 to 10:41:02 or 32 seconds (31 on node2).  I believe if you had a
> dampen value that was greater than monitor value plus time failed then
> nothing would have happened (dampen > 10 + 32).  However I'm not sure I
> would call 32 seconds a blip in connection - that's up to you.  And since
> the dampen applies to all of the ping clones equally assuming a ping failure
> longer than your dampen value you would still have the same problem.  For
> example assuming a dampen of 45 seconds:
> Node1 fails at 1:01, node2 fails at 1:08
> Node1 will still update its pingd value at 1:52 - 7 seconds before node2
> will and the transition will still happen even though both nodes have the
> same connectivity in reality.
>
> I guess what I'm saying in the end is that dampen is there to prevent
> movement for a momentary outage/blip in the pings with the idea being that
> the pings will return before the dampen expires.  It isn't going to wait out
> the dampen on the other node(s) before making a decision.  You would need to
> be able to add something like a sleep 10s in there AFTER the pingd value is
> updated BEFORE evaluating the node preference scoring!
>
> So in the end I don't have a fix for you except maybe to set dampen in the
> 45-60 second range if you expect around 30 second outages that you want to
> ride out without moving to be common place in your setup.  However that
> would extend the time to wait till failover in case of a complete failure of
> pings on one node only.
>
> :-(
>
> Jake
>
>>
>> Thanks,
>>
>>
>> Andrew
>>
>> ----- Original Message -----
>>
>> From: "Jake Smith" <jsmith at argotec.com>
>> To: "The Pacemaker cluster resource manager"
>> <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, August 27, 2012 9:39:30 AM
>> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
>> resources to restart?
>>
>>
>> ----- Original Message -----
>> > From: "Andrew Martin" <amartin at xes-inc.com>
>> > To: "The Pacemaker cluster resource manager"
>> > <pacemaker at oss.clusterlabs.org>
>> > Sent: Thursday, August 23, 2012 7:36:26 PM
>> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
>> > resources to restart?
>> >
>> > Hi Florian,
>> >
>> >
>> > Thanks for the suggestion. I gave it a try, but even with a dampen
>> > value greater than 2* the monitoring interval the same behavior
>> > occurred (pacemaker restarted the resources on the same node). Here
>> > are my current ocf:pacemaker:ping settings:
>> >
>> > primitive p_ping ocf:pacemaker:ping \
>> > params name="p_ping" host_list="192.168.0.128 192.168.0.129"
>> > dampen="25s" multiplier="1000" attempts="8" debug="true" \
>> > op start interval="0" timeout="60" \
>> > op monitor interval="10s" timeout="60"
>> >
>> >
>> > Any other ideas on what is causing this behavior? My understanding
>> > is
>> > the above config tells the cluster to attempt 8 pings to each of
>> > the
>> > IPs, and will assume that an IP is down if none of the 8 come back.
>> > Thus, an IP would have to be down for more than 8 seconds to be
>> > considered down. The dampen parameter tells the cluster to wait
>> > before making any decision, so that if the IP comes back online
>> > within the dampen period then no action is taken. Is this correct?
>> >
>> >
>>
>> I'm no expert on this either but I believe the dampen isn't long
>> enough - I think what you say above is correct but not only does the
>> IP need to come back online but the cluster must attempt to ping it
>> successfully also. I would suggest trying dampen with greater than
>> 3*monitor value.
>>
>> I don't think it's a problem but why change the attempts from the
>> default 2 to 8?
>>
>> > Thanks,
>> >
>> >
>> > Andrew
>> >
>> >
>> > ----- Original Message -----
>> >
>> > From: "Florian Crouzat" <gentoo at floriancrouzat.net>
>> > To: pacemaker at oss.clusterlabs.org
>> > Sent: Thursday, August 23, 2012 3:57:02 AM
>> > Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
>> > resources to restart?
>> >
>> > Le 22/08/2012 18:23, Andrew Martin a écrit :
>> > > Hello,
>> > >
>> > >
>> > > I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and
>> > > 1
>> > > quorum node that cannot run resources) running on Ubuntu 12.04
>> > > Server amd64. This cluster has a DRBD resource that it mounts and
>> > > then runs a KVM virtual machine from. I have configured the
>> > > cluster to use ocf:pacemaker:ping with two other devices on the
>> > > network (192.168.0.128, 192.168.0.129), and set constraints to
>> > > move the resources to the most well-connected node (whichever
>> > > node
>> > > can see more of these two devices):
>> > >
>> > > primitive p_ping ocf:pacemaker:ping \
>> > > params name="p_ping" host_list="192.168.0.128 192.168.0.129"
>> > > multiplier="1000" attempts="8" debug="true" \
>> > > op start interval="0" timeout="60" \
>> > > op monitor interval="10s" timeout="60"
>> > > ...
>> > >
>> > > clone cl_ping p_ping \
>> > > meta interleave="true"
>> > >
>> > > ...
>> > > location loc_run_on_most_connected g_vm \
>> > > rule $id="loc_run_on_most_connected-rule" p_ping: defined p_ping
>> > >
>> > >
>> > > Today, 192.168.0.128's network cable was unplugged for a few
>> > > seconds and then plugged back in. During this time, pacemaker
>> > > recognized that it could not ping 192.168.0.128 and restarted all
>> > > of the resources, but left them on the same node. My
>> > > understanding
>> > > was that since neither node could ping 192.168.0.128 during this
>> > > period, pacemaker would do nothing with the resources (leave them
>> > > running). It would only migrate or restart the resources if for
>> > > example node2 could ping 192.168.0.128 but node1 could not (move
>> > > the resources to where things are better-connected). Is this
>> > > understanding incorrect? If so, is there a way I can change my
>> > > configuration so that it will only restart/migrate resources if
>> > > one node is found to be better connected?
>> > >
>> > > Can you tell me why these resources were restarted? I have
>> > > attached
>> > > the syslog as well as my full CIB configuration.
>> > >
>>
>> As was said already the log shows node1 changed it's value for pingd
>> to 1000, waited the 5 seconds of dampening and then started actions
>> to move the resources. In the midst of stopping everything ping ran
>> again successfully and the value increase back to 2000. This caused
>> the policy engine to recalculate scores for all resources (before
>> they had the chance to start on node2). I'm no scoring expert but I
>> know there is additional value given to keep resources that are
>> collocated together with their partners that are already running and
>> resource stickiness to not move. So in this situation the score to
>> stay/run on node1 once pingd was back at 2000 was greater that the
>> score to move so things that were stopped or stopping restarted on
>> node1. So increasing the dampen value should help/fix.
>>
>> Unfortunately you didn't include the log from node2 so we can't
>> correlate what node2's pingd values are at the same times as node1.
>> I believe if you look at the pingd values and times that movement is
>> started between the nodes you will be able to make a better guess at
>> how high a dampen value would make sure the nodes had the same pingd
>> value *before* the dampen time ran out and that should prevent
>> movement.
>>
>> HTH
>>
>> Jake
>>
>> > > Thanks,
>> > >
>> > > Andrew Martin
>> > >
>> >
>> > This is an interesting question and I'm also interested in answers.
>> >
>> > I had the same observations, and there is also the case where the
>> > monitor() aren't synced across all nodes so, "Node 1 issue a
>> > monitor()
>> > on the ping resource and finds ping-node dead, node2 hasn't pinged
>> > yet,
>> > so node1 moves things to node2 but node2 now issue a monitor() and
>> > also
>> > finds ping-node dead."
>> >
>> > The only solution I found was to adjust the dampen parameter to at
>> > least
>> > 2*monitor().interval so that I can be *sure* that all nodes have
>> > issued
>> > a monitor() and they all decreased they scores so that when a
>> > decision
>> > occurs, nothings move.
>> >
>> > It's been a long time I haven't tested, my cluster is very very
>> > stable,
>> > I guess I should retry to validate it's still a working trick.
>> >
>> > ====
>> >
>> > dampen (integer, [5s]): Dampening interval
>> > The time to wait (dampening) further changes occur
>> >
>> > Eg:
>> >
>> > primitive ping-nq-sw-swsec ocf:pacemaker:ping \
>> > params host_list="192.168.10.1 192.168.2.11 192.168.2.12"
>> > dampen="35s" attempts="2" timeout="2" multiplier="100" \
>> > op monitor interval="15s"
>> >
>> >
>> >
>> >
>> > --
>> > Cheers,
>> > Florian Crouzat
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>