[Pacemaker] How to really deal with gateway restarts?

Mon Jun 14 06:13:59 UTC 2010

On Thu, Jun 10, 2010 at 9:22 PM, Maros Timko <timkom at gmail.com> wrote:
> Hi all,
>
> I know it was requested here number of times, but with no real
> conclusive answer. All of the requests were update Pacemaker and use
> ping RA.
>
> Setup:
>  - simple symetric 2 node DRBD-Xen cluster
>  - both nodes connected to the same network and gateway
>  - cloned ping RA to monitor gateway and update pingd attribute
>  - pingd:defined used to migrate resources on node with better
> communication abilities
>
> Scenario:
>  - simulate gateway failure or restart
>
> Expected outcome:
>  - active node should remain active without touching resources because
> both nodes has the same score (pingd=0) and pingd:defined means "do
> not shutdown resources when node looses connectivity"
>
> Experienced outcome:
>  - CRM initiates resource migration
>  - Xen VM is stopped
>  - CRM aborts resource migration
>  - Xen VM is started
>  - active node is active again, but VM was restarted
>
> Analyses of the problem:
>  - because currently active node is DC (but probably not only for this
> reason) the update of pingd from active node is processed as the first
> one. It is done before the update from standby is processed meaning
> standby has better score. Thus CRM decides to migrate resources.
>  - attribute update from standby node is processed, meaning rolling
> back of the migration
>
> Possible resolutions:
>  - tweak the standby ping RA to postpone updates a bit (a bit stupid
> and asymetric)
>  - ensure that standby is DC (no CLI option and not sure if that would
> help though)
>  - ensure that standby monitoring cycle is delayed after active one
> (but how with cloned RA)
>  - any other proposal?
>
> I thought "dampen" attribute could help with some of the options, but
> actually it is does not.

It should do.  Hard to say without any logs from the two machines.

> The only thing that worked for me was
> restarting of standby CRM, until its monitoring cycle was a bit behind
> the active. But I would not be happy with it.
> Does anybody have any idea if there could be some option like "Hey,
> change of this attribute can trigger resource migration. Let's wait a
> while (configured) for standby value update..."? Or any other crazy
> ideas?
>
> Thanks,
> Tino
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>