[Pacemaker] migration-threshold causing unnecessary restart of underlying resources
Cnut Jansen
work at cnutjansen.eu
Thu Aug 12 02:12:02 UTC 2010
Hi,
I'm once again experiencing (imho) strange behaviour respectively
decision-making by Pacemaker, and I hope that someone can either
enlighten me a little about this, its intention and/or a possible
misconfiguration or something, or confirm it a possible bug.
Basically I have a cluster of 2 nodes with cloned DLM-, O2CB-, DRBD-,
mount-resources, and a MySQL-resource (grouped with an IPaddr-resource)
running on top of the other ones.
The MySQL(-group)-resource depends on the mount-resource, which depends
on both, the DRBD- and the O2CB-resources equally, and the O2CB-resource
depends on the DLM-resource.
cloneDlm -> cloneO2cb -\
}-> cloneMountMysql -> mysql / grpMysql( mysql
-> ipMysql )
msDrbdMysql -----------/
Furthermore for the MySQL(-group)-resource I set meta-attributes
"migration-threshold=1" and "failure-timeout=90" (later also tried
settings "3" and "130" for these).
Now I picked a little on mysql using "crm_resource -F -r mysql -H
<node>", expecting that only mysql respectively its group (tested both
configurations; same result) would be stopped (and moved over to the
other node).
But actually not only mysql/grpMysql was stopped, but also the mount-
and even the DRBD-resources were stopped, and upon restarting them the
DRBD-resource was left as slave (thus the mount of course wasn't allowed
to restart either) and - back then before I set
cluster-recheck-interval=2m - didn't seem to even try to promote back to
master (didn't wait cluster-recheck-interval's default 15m).
Now through a lot of testing I found out that:
a) the stops/restarts of the underlying resources happen only when
failcounter hits the limit set by migration-threshold; i.e. when set to
3, on first 2 failures only mysql/grpMysql is restarted on the same node
and only on 3rd one underlying resources are left in a mess (while
mysql/grpMysql migrates) (for DRBD reproducable; unsure about
DLM/O2CB-side, but there's sometimes hard trouble too after having
picked on mysql; just couldn't definitively link it yet)
b) upon causing mysql/grpMysql's migration, score for
msDrbdMysql:promote changes from 10020 to -inf and stays there for the
time of mysql/grpMysql's failure-timeout (proved with also setting to
130), before it rises back up to 10000
c) msDrbdMysql remains slave until the next cluster-recheck after its
promote-score went back up to 10000
d) I also have the impression that fail-counters don't get reset after
their failure-timeout, because when migration-threshold=3 is set, upon
every(!) following picking-on those issues occure, even when I've waited
for nearly 5 minutes (with failure-timeout=90) without any touching the
cluster
I experienced this on both test-clusters, a SLES 11 HAE SP1 with
Pacemaker 1.1.2, and a Debian Squeeze with Pacemaker 1.0.9. When
migration-threshold for mysql/grpMysql is removed, everything is fine
(except no migration of course). I can't remember such happening with
SLES 11 HAE SP0's Pacemaker 1.0.6.
I'd really appreciate any comment and/or enlightment about what's the
deal with this. (-;
p.s.: Just for fun / testing / proving I just also contrainted
grpLdirector to cloneMountShared... and could perfectly reproduce that
problem with its then underlying resources too.
================================================================================
2) mysql: meta migration-threshold=1 failure-timeout=130 ->
drbd:promote erst nach 130sek score-technisch wieder möglich
nde34:~ # nd=nde35;cl=1;failcmd="crm_resource -F -r mysql -H $nd" ; date
; ptest -sL | grep "drbdMysql:$cl promotion score on $nd" ; date ; echo
$failcmd; $failcmd ; date ; ptest -sL | grep "drbdMysql:$cl promotion
score on $nd" ; sleep 85 ; while [ true ]; do date ; ptest -sL | grep
"drbdMysql:$cl promotion score on $nd" ; sleep 5; done
Wed Aug 11 15:33:04 CEST 2010
drbdMysql:1 promotion score on nde35: 10020
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
Wed Aug 11 15:33:04 CEST 2010
crm_resource -F -r mysql -H nde35
Wed Aug 11 15:33:05 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:34:31 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
[...]
Wed Aug 11 15:35:11 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:35:16 CEST 2010
drbdMysql:1 promotion score on nde35: 10000
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
^C
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cluster-conf - sles11sp1.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100812/023d2b81/attachment-0006.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cluster-conf - squeeze.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100812/023d2b81/attachment-0007.txt>
More information about the Pacemaker
mailing list