[Pacemaker] resource moving unnecessarily due to ping race condition
Brad Johnson
bjohnson at ecessa.com
Tue Sep 13 20:36:20 UTC 2011
Yes, the suggested approach has the problem when both nodes drop to a
score of zero the resource can not run anywhere. I have gone back to my
original "best connectivity" approach, but now using my own ping RA
which uses different dampening delay on the active vs. standby node. On
the active node when the score is rising, and on the standby node when
the score is falling, a delay of zero is used. The other cases use the
configured delay. This works much better at keeping our resource from
failing over when ping hosts are brought down and back up. But the
problem still happens some of the time.
There are 2 problems I see:
1) the dampening delay is over-ridden when we receive a flush message
from the other node - instead we immediately send an update with the
current value.
2) the dampen value should be as large as the product of the OCF RA
attempts * timeout values, since the nodes are asynchronously pinging
and may be off as much as an entire interval. BUT pacemaker seems to not
work properly when the dampen value is larger than the resource interval.
Any suggestions please would be appreciated.
...Brad
On 09/10/2011 11:30 AM, Vadym Chepkov wrote:
> On Sep 8, 2011, at 3:40 PM, Florian Haas wrote:
>
>>>> On 09/08/11 20:59, Brad Johnson wrote:
>>>>> We have a 2 node cluster with a single resource. The resource must run
>>>>> on only a single node at one time. Using the pacemaker:ocf:ping RA we
>>>>> are pinging a WAN gateway and a LAN host on each node so the resource
>>>>> runs on the node with the greatest connectivity. The problem is when a
>>>>> ping host goes down (so both nodes lose connectivity to it), the
>>>>> resource moves to the other node due to timing differences in how fast
>>>>> they update the score attribute. The dampening value has no effect,
>>>>> since it delays both nodes by the same amount. These unnecessary
>>>>> fail-overs aren't acceptable since they are disruptive to the network
>>>>> for no reason.
>>>>> Is there a way to dampen the ping update by different amounts on the
>>>>> active and passive nodes? Or some other way to configure the cluster to
>>>>> try to keep the resource where it is during these tie score scenarios?
>> location pingd-constraint group_1 \
>> rule $id="pingd-constraint-rule" pingd: defined pingd
>>
>> May I suggest that you simply change this constraint to
>>
>> location pingd-constraint group_1 \
>> rule $id="pingd-constraint-rule" \
>> -inf: not_defined pingd or pingd lte 0
>>
>> That way, only a host that definitely has _no_ connectivity carries a
>> -INF score for that resource group. And I believe that is what you
>> really want, rather than take the actual ping score as a placement
>> weight (your "best connectivity" approach).
>>
>> Just my 2 cents, though.
>>
> Even though this approach was recommended many times, there is a problem with it.
> What if all nodes for some reason are not able to ping ?
> This rule would cause a resource to be brought down completely, whereas if you use "best connectivity" approach it will stay up where it was before network failed.
>
> Vadym
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list