[Pacemaker] How to really deal with gateway restarts?

Thu Jun 10 15:22:22 EDT 2010

Hi all,

I know it was requested here number of times, but with no real
conclusive answer. All of the requests were update Pacemaker and use
ping RA.

Setup:
 - simple symetric 2 node DRBD-Xen cluster
 - both nodes connected to the same network and gateway
 - cloned ping RA to monitor gateway and update pingd attribute
 - pingd:defined used to migrate resources on node with better
communication abilities

Scenario:
 - simulate gateway failure or restart

Expected outcome:
 - active node should remain active without touching resources because
both nodes has the same score (pingd=0) and pingd:defined means "do
not shutdown resources when node looses connectivity"

Experienced outcome:
 - CRM initiates resource migration
 - Xen VM is stopped
 - CRM aborts resource migration
 - Xen VM is started
 - active node is active again, but VM was restarted

Analyses of the problem:
 - because currently active node is DC (but probably not only for this
reason) the update of pingd from active node is processed as the first
one. It is done before the update from standby is processed meaning
standby has better score. Thus CRM decides to migrate resources.
 - attribute update from standby node is processed, meaning rolling
back of the migration

Possible resolutions:
 - tweak the standby ping RA to postpone updates a bit (a bit stupid
and asymetric)
 - ensure that standby is DC (no CLI option and not sure if that would
help though)
 - ensure that standby monitoring cycle is delayed after active one
(but how with cloned RA)
 - any other proposal?

I thought "dampen" attribute could help with some of the options, but
actually it is does not. The only thing that worked for me was
restarting of standby CRM, until its monitoring cycle was a bit behind
the active. But I would not be happy with it.
Does anybody have any idea if there could be some option like "Hey,
change of this attribute can trigger resource migration. Let's wait a
while (configured) for standby value update..."? Or any other crazy
ideas?

Thanks,
Tino