[Pacemaker] migration-threshold causing unnecessary restart of underlying resources

Fri Sep 24 13:23:59 EDT 2010

Am 12.08.2010 04:12, schrieb Cnut Jansen:

> Basically I have a cluster of 2 nodes with cloned DLM-, O2CB-, DRBD-,
> mount-resources, and a MySQL-resource (grouped with an IPaddr-resource)
> running on top of the other ones.
> The MySQL(-group)-resource depends on the mount-resource, which depends
> on both, the DRBD- and the O2CB-resources equally, and the O2CB-resource
> depends on the DLM-resource.
> cloneDlm -> cloneO2cb -\
>                         }-> cloneMountMysql -> mysql / grpMysql( mysql
> -> ipMysql )
> msDrbdMysql -----------/
> Furthermore for the MySQL(-group)-resource I set meta-attributes
> "migration-threshold=1" and "failure-timeout=90" (later also tried
> settings "3" and "130" for these).

> Now through a lot of testing I found out that:
> a) the stops/restarts of the underlying resources happen only when
> failcounter hits the limit set by migration-threshold; i.e. when set to
> 3, on first 2 failures only mysql/grpMysql is restarted on the same node
> and only on 3rd one underlying resources are left in a mess (while
> mysql/grpMysql migrates) (for DRBD reproducable; unsure about
> DLM/O2CB-side, but there's sometimes hard trouble too after having
> picked on mysql; just couldn't definitively link it yet)
> b) upon causing mysql/grpMysql's migration, score for
> msDrbdMysql:promote changes from 10020 to -inf and stays there for the
> time of mysql/grpMysql's failure-timeout (proved with also setting to
> 130), before it rises back up to 10000
> c) msDrbdMysql remains slave until the next cluster-recheck after its
> promote-score went back up to 10000
> d) I also have the impression that fail-counters don't get reset after
> their failure-timeout, because when migration-threshold=3 is set, upon
> every(!) following picking-on those issues occure, even when I've waited
> for nearly 5 minutes (with failure-timeout=90) without any touching the
> cluster
> 
> I experienced this on both test-clusters, a SLES 11 HAE SP1 with
> Pacemaker 1.1.2, and a Debian Squeeze with Pacemaker 1.0.9. When
> migration-threshold for mysql/grpMysql is removed, everything is fine
> (except no migration of course). I can't remember such happening with
> SLES 11 HAE SP0's Pacemaker 1.0.6.

> p.s.: Just for fun / testing / proving I just also contrainted
> grpLdirector to cloneMountShared... and could perfectly reproduce that
> problem with its then underlying resources too.

For reference:
SLES11-HAE-SP1: Issues seem to be solved with latest officially released
packages (upgraded yesterday directly from Novell's repositories),
including Pacemaker version 1.1.2-0.6.1 (Arch: x86_64), shown
in crm_mon as "1.1.2-ecb1e2ea172ba2551f0bd763e557fccde68c849b". At
least so far I couldn't reproduce any unnecessary restart of underlying
resources (nor any other touching them at all), and fail-counters now -
after failure-timeout is over - get reset upon next cluster-recheck
(event- or interval-driven).
Debian Squeeze: Not tested again yet