No subject

Sun Aug 12 13:32:27 UTC 2012

> Aug 22 10:40:38 node1 attrd_updater: [1860]: info: Invoked: attrd_updater=
 -n p_ping -v 1000 -d 5s
> Aug 22 10:40:43 node1 attrd: [4402]: notice: attrd_trigger_update: Sendin=
g flush op to all hosts for: p_ping (1000)
> Aug 22 10:40:44 node1 attrd: [4402]: notice: attrd_perform_update: Sent u=
pdate 265: p_ping=3D1000
>
> Aug 22 10:40:45 node2 attrd_updater: [27245]: info: Invoked: attrd_update=
r -n p_ping -v 1000 -d 5s
> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_trigger_update: Sendin=
g flush op to all hosts for: p_ping (1000)
> Aug 22 10:40:50 node2 attrd: [4069]: notice: attrd_perform_update: Sent u=
pdate 122: p_ping=3D1000
>
> I had changed the attempts value to 8 (from the default 2) to address thi=
s same issue - to avoid resource migration based on brief connectivity prob=
lems with these IPs - however if we can get dampen configured correctly I'l=
l set it back to the default.
>
>
> Thanks,
>
>
> Andrew
>
> ----- Original Message -----
>
> From: "Jake Smith" <jsmith at argotec.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.o=
rg>
> Sent: Monday, August 27, 2012 9:39:30 AM
> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resourc=
es to restart?
>
>
> ----- Original Message -----
>> From: "Andrew Martin" <amartin at xes-inc.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.=
org>
>> Sent: Thursday, August 23, 2012 7:36:26 PM
>> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces resour=
ces to restart?
>>
>> Hi Florian,
>>
>>
>> Thanks for the suggestion. I gave it a try, but even with a dampen
>> value greater than 2* the monitoring interval the same behavior
>> occurred (pacemaker restarted the resources on the same node). Here
>> are my current ocf:pacemaker:ping settings:
>>
>> primitive p_ping ocf:pacemaker:ping \
>> params name=3D"p_ping" host_list=3D"192.168.0.128 192.168.0.129"
>> dampen=3D"25s" multiplier=3D"1000" attempts=3D"8" debug=3D"true" \
>> op start interval=3D"0" timeout=3D"60" \
>> op monitor interval=3D"10s" timeout=3D"60"
>>
>>
>> Any other ideas on what is causing this behavior? My understanding is
>> the above config tells the cluster to attempt 8 pings to each of the
>> IPs, and will assume that an IP is down if none of the 8 come back.
>> Thus, an IP would have to be down for more than 8 seconds to be
>> considered down. The dampen parameter tells the cluster to wait
>> before making any decision, so that if the IP comes back online
>> within the dampen period then no action is taken. Is this correct?
>>
>>
>
> I'm no expert on this either but I believe the dampen isn't long enough -=
 I think what you say above is correct but not only does the IP need to com=
e back online but the cluster must attempt to ping it successfully also. I =
would suggest trying dampen with greater than 3*monitor value.
>
> I don't think it's a problem but why change the attempts from the default=
 2 to 8?
>
>> Thanks,
>>
>>
>> Andrew
>>
>>
>> ----- Original Message -----
>>
>> From: "Florian Crouzat" <gentoo at floriancrouzat.net>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Thursday, August 23, 2012 3:57:02 AM
>> Subject: Re: [Pacemaker] Loss of ocf:pacemaker:ping target forces
>> resources to restart?
>>
>> Le 22/08/2012 18:23, Andrew Martin a =E9crit :
>> > Hello,
>> >
>> >
>> > I have a 3 node Pacemaker + Heartbeat cluster (two real nodes and 1
>> > quorum node that cannot run resources) running on Ubuntu 12.04
>> > Server amd64. This cluster has a DRBD resource that it mounts and
>> > then runs a KVM virtual machine from. I have configured the
>> > cluster to use ocf:pacemaker:ping with two other devices on the
>> > network (192.168.0.128, 192.168.0.129), and set constraints to
>> > move the resources to the most well-connected node (whichever node
>> > can see more of these two devices):
>> >
>> > primitive p_ping ocf:pacemaker:ping \
>> > params name=3D"p_ping" host_list=3D"192.168.0.128 192.168.0.129"
>> > multiplier=3D"1000" attempts=3D"8" debug=3D"true" \
>> > op start interval=3D"0" timeout=3D"60" \
>> > op monitor interval=3D"10s" timeout=3D"60"
>> > ...
>> >
>> > clone cl_ping p_ping \
>> > meta interleave=3D"true"
>> >
>> > ...
>> > location loc_run_on_most_connected g_vm \
>> > rule $id=3D"loc_run_on_most_connected-rule" p_ping: defined p_ping
>> >
>> >
>> > Today, 192.168.0.128's network cable was unplugged for a few
>> > seconds and then plugged back in. During this time, pacemaker
>> > recognized that it could not ping 192.168.0.128 and restarted all
>> > of the resources, but left them on the same node. My understanding
>> > was that since neither node could ping 192.168.0.128 during this
>> > period, pacemaker would do nothing with the resources (leave them
>> > running). It would only migrate or restart the resources if for
>> > example node2 could ping 192.168.0.128 but node1 could not (move
>> > the resources to where things are better-connected). Is this
>> > understanding incorrect? If so, is there a way I can change my
>> > configuration so that it will only restart/migrate resources if
>> > one node is found to be better connected?
>> >
>> > Can you tell me why these resources were restarted? I have attached
>> > the syslog as well as my full CIB configuration.
>> >
>
> As was said already the log shows node1 changed it's value for pingd to 1=
000, waited the 5 seconds of dampening and then started actions to move the=
 resources. In the midst of stopping everything ping ran again successfully=
 and the value increase back to 2000. This caused the policy engine to reca=
lculate scores for all resources (before they had the chance to start on no=
de2). I'm no scoring expert but I know there is additional value given to k=
eep resources that are collocated together with their partners that are alr=
eady running and resource stickiness to not move. So in this situation the =
score to stay/run on node1 once pingd was back at 2000 was greater that the=
 score to move so things that were stopped or stopping restarted on node1. =
So increasing the dampen value should help/fix.
>
> Unfortunately you didn't include the log from node2 so we can't correlate=
 what node2's pingd values are at the same times as node1. I believe if you=
 look at the pingd values and times that movement is started between the no=
des you will be able to make a better guess at how high a dampen value woul=
d make sure the nodes had the same pingd value *before* the dampen time ran=
 out and that should prevent movement.
>
> HTH
>
> Jake
>
>> > Thanks,
>> >
>> > Andrew Martin
>> >
>>
>> This is an interesting question and I'm also interested in answers.
>>
>> I had the same observations, and there is also the case where the
>> monitor() aren't synced across all nodes so, "Node 1 issue a
>> monitor()
>> on the ping resource and finds ping-node dead, node2 hasn't pinged
>> yet,
>> so node1 moves things to node2 but node2 now issue a monitor() and
>> also
>> finds ping-node dead."
>>
>> The only solution I found was to adjust the dampen parameter to at
>> least
>> 2*monitor().interval so that I can be *sure* that all nodes have
>> issued
>> a monitor() and they all decreased they scores so that when a
>> decision
>> occurs, nothings move.
>>
>> It's been a long time I haven't tested, my cluster is very very
>> stable,
>> I guess I should retry to validate it's still a working trick.
>>
>> =3D=3D=3D=3D
>>
>> dampen (integer, [5s]): Dampening interval
>> The time to wait (dampening) further changes occur
>>
>> Eg:
>>
>> primitive ping-nq-sw-swsec ocf:pacemaker:ping \
>> params host_list=3D"192.168.10.1 192.168.2.11 192.168.2.12"
>> dampen=3D"35s" attempts=3D"2" timeout=3D"2" multiplier=3D"100" \
>> op monitor interval=3D"15s"
>>
>>
>>
>>
>> --
>> Cheers,
>> Florian Crouzat
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>