[Pacemaker] resource moving unnecessarily due to ping race condition

Wed Sep 14 05:51:26 EDT 2011

On 09/13/2011 10:36 PM, Brad Johnson wrote:
> Yes, the suggested approach has the problem when both nodes drop to a
> score of zero the resource can not run anywhere. I have gone back to my
> original "best connectivity" approach, but now using my own ping RA
> which uses different dampening delay on the active vs. standby node. On
> the active node when the score is rising, and on the standby node when
> the score is falling, a delay of zero is used. The other cases use the
> configured delay. This works much better at keeping our resource from
> failing over when ping hosts are brought down and back up. But the
> problem still happens some of the time.
> There are 2 problems I see:
> 1) the dampening delay is over-ridden when we receive a flush message
> from the other node - instead we immediately send an update with the
> current value.
> 2) the dampen value should be as large as the product of the OCF RA
> attempts * timeout values, since the nodes are asynchronously pinging
> and may be off as much as an entire interval. BUT pacemaker seems to not
> work properly when the dampen value is larger than the resource interval.

There have been some fixes in Pacemaker 1.0.11 to make this work
properly ... dampen value is a multiple of monitor interval

Regards,
Andreas

> 
> Any suggestions please would be appreciated.
> 
> ...Brad
> 
> On 09/10/2011 11:30 AM, Vadym Chepkov wrote:
>> On Sep 8, 2011, at 3:40 PM, Florian Haas wrote:
>>
>>>>> On 09/08/11 20:59, Brad Johnson wrote:
>>>>>> We have a 2 node cluster with a single resource. The resource must
>>>>>> run
>>>>>> on only a single node at one time. Using the pacemaker:ocf:ping RA we
>>>>>> are pinging a WAN gateway and a LAN host on each node so the resource
>>>>>> runs on the node with the greatest connectivity. The problem is
>>>>>> when a
>>>>>> ping host goes down (so both nodes lose connectivity to it), the
>>>>>> resource moves to the other node due to timing differences in how
>>>>>> fast
>>>>>> they update the score attribute. The dampening value has no effect,
>>>>>> since it delays both nodes by the same amount. These unnecessary
>>>>>> fail-overs aren't acceptable since they are disruptive to the network
>>>>>> for no reason.
>>>>>> Is there a way to dampen the ping update by different amounts on the
>>>>>> active and passive nodes? Or some other way to configure the
>>>>>> cluster to
>>>>>> try to keep the resource where it is during these tie score
>>>>>> scenarios?
>>> location pingd-constraint group_1 \
>>>   rule $id="pingd-constraint-rule" pingd: defined pingd
>>>
>>> May I suggest that you simply change this constraint to
>>>
>>> location pingd-constraint group_1 \
>>>   rule $id="pingd-constraint-rule" \
>>>     -inf: not_defined pingd or pingd lte 0
>>>
>>> That way, only a host that definitely has _no_ connectivity carries a
>>> -INF score for that resource group. And I believe that is what you
>>> really want, rather than take the actual ping score as a placement
>>> weight (your "best connectivity" approach).
>>>
>>> Just my 2 cents, though.
>>>
>> Even though this approach was recommended many times, there is a
>> problem with it.
>> What if all nodes for some reason are not able to ping ?
>> This rule would cause a resource to be brought down completely,
>> whereas if you use "best connectivity" approach it will stay up where
>> it was before network failed.
>>
>> Vadym
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker