[Pacemaker] migration-threshold causing unnecessary restart of underlying resources
Cnut Jansen
work at cnutjansen.eu
Sat Aug 14 04:26:58 UTC 2010
Hi,
and first of all thanks for answering so far.
Am 12.08.2010 18:46, schrieb Dejan Muhamedagic:
>
> The migration-threshold shouldn't in any way influence resources
> which don't depend on the resource which fails over. Couldn't
> reproduce it here with our example RAs.
Well, I now - just to clearly assure that something's wrong there;
whatever it is, simple misconfiguration or possible bug - did crm
configure erase, completely restarted both nodes, and then setup this
new, very simple, dummy-based configuration:
v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v v
v v v v
node alpha \
attributes standby="off"
node beta \
attributes standby="off"
primitive dlm ocf:heartbeat:Dummy
primitive drbd ocf:heartbeat:Dummy
primitive mount ocf:heartbeat:Dummy
primitive mysql ocf:heartbeat:Dummy \
meta migration-threshold="3" failure-timeout="40"
primitive o2cb ocf:heartbeat:Dummy
location cli-prefer-mount mount \
rule $id="cli-prefer-rule-mount" inf: #uname eq alpha
colocation colocMysql inf: mysql mount
order orderMysql inf: mount mysql
property $id="cib-bootstrap-options" \
dc-version="1.0.9-unknown" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
cluster-recheck-interval="150" \
last-lrm-refresh="1281751924"
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
^ ^ ^ ^
...and then, with picking on the resource "mysql", got this:
1) alpha: FC(mysql)=0, crm_resource -F -r mysql -H alpha
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_asyncmon_0 (call=48, rc=1, cib-update=563,
confirmed=false) unknown error
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=49, rc=0, cib-update=565, confirmed=true) ok
Aug 14 04:15:30 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_start_0 (call=50, rc=0, cib-update=567, confirmed=true) ok
2) alpha: FC(mysql)=1, crm_resource -F -r mysql -H alpha
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_asyncmon_0 (call=51, rc=1, cib-update=568,
confirmed=false) unknown error
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=52, rc=0, cib-update=572, confirmed=true) ok
Aug 14 04:15:42 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_start_0 (call=53, rc=0, cib-update=573, confirmed=true) ok
3) alpha: FC(mysql)=2, crm_resource -F -r mysql -H alpha
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_asyncmon_0 (call=54, rc=1, cib-update=574,
confirmed=false) unknown error
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=55, rc=0, cib-update=576, confirmed=true) ok
Aug 14 04:15:56 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_stop_0 (call=56, rc=0, cib-update=578, confirmed=true) ok
beta: (FC(mysql)=3
Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM operation
mount_start_0 (call=36, rc=0, cib-update=92, confirmed=true) ok
Aug 14 04:15:56 beta crmd: [868]: info: process_lrm_event: LRM operation
mysql_start_0 (call=37, rc=0, cib-update=93, confirmed=true) ok
Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM operation
mysql_stop_0 (call=38, rc=0, cib-update=94, confirmed=true) ok
Aug 14 04:18:26 beta crmd: [868]: info: process_lrm_event: LRM operation
mount_stop_0 (call=39, rc=0, cib-update=95, confirmed=true) ok
alpha: FC(mysql)=3
Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_start_0 (call=57, rc=0, cib-update=580, confirmed=true) ok
Aug 14 04:18:26 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_start_0 (call=58, rc=0, cib-update=581, confirmed=true) ok
So it seems that - for what reason ever - those constrainted resources
are considered and treated just as they were in a resource-group,
because they move to where they all can run, instead of the "eat or die"
for the dependent resource (mysql) to the underlying resource (mount)
that I had expected with such constraints as I set them... shouldn't I?! o_O
And - concerning the failure-timeout - quite a while later, without
having resetted mysql's failure counter or having done anything else in
the meantime:
4) alpha: FC(mysql)=3, crm_resource -F -r mysql -H alpha
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_asyncmon_0 (call=59, rc=1, cib-update=592,
confirmed=false) unknown error
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_stop_0 (call=60, rc=0, cib-update=596, confirmed=true) ok
Aug 14 04:44:47 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_stop_0 (call=61, rc=0, cib-update=597, confirmed=true) ok
beta: FC(mysql)=0
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM operation
mount_start_0 (call=40, rc=0, cib-update=96, confirmed=true) ok
Aug 14 04:44:47 beta crmd: [868]: info: process_lrm_event: LRM operation
mysql_start_0 (call=41, rc=0, cib-update=97, confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM operation
mysql_stop_0 (call=42, rc=0, cib-update=98, confirmed=true) ok
Aug 14 04:47:17 beta crmd: [868]: info: process_lrm_event: LRM operation
mount_stop_0 (call=43, rc=0, cib-update=99, confirmed=true) ok
alpha: FC(mysql)=4
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM
operation mount_start_0 (call=62, rc=0, cib-update=599, confirmed=true) ok
Aug 14 04:47:17 alpha crmd: [900]: info: process_lrm_event: LRM
operation mysql_start_0 (call=63, rc=0, cib-update=600, confirmed=true) ok
> BTW, what's the point of cloneMountMysql? If it can run only
> where drbd is master, then it can run on one node only:
>
> colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master
> order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start
It's a dual-primary-DRBD-configuration, so there are actually - when
everything is ok (-; - 2 masters of each DRBD-multistate-resource...
even though I admit that at least the dual primary respectively master
for msDrbdMysql is currently (quite) redundant, since in the current
cluster configuration there's only one, primitive MySQL-resource and
thus there'd be no inevitable need for MySQL's data-dir being mounted
all time on both nodes.
But since it's not harmful to have it mounted on the other node too, and
since msDrbdOpencms and msDrbdShared need to be mounted on both nodes
and since I put the complete installation and configuration of the
cluster into flexibly configurable shell-scripts, it's easier
respectively done with less typing to just put all DRBD- and
mount-resources' configuration into just one common loop. (-;
>> d) I also have the impression that fail-counters don't get reset
>> after their failure-timeout, because when migration-threshold=3 is
>> set, upon every(!) following picking-on those issues occure, even
>> when I've waited for nearly 5 minutes (with failure-timeout=90)
>> without any touching the cluster
> That seems to be a bug though I couldn't reproduce it with a
> simple configuration.
I just also tested this once again: It seems like that failure-timeout
only sets back scores from -inf to around 0 (whereever they should
normally be), allowing the resources to return back to the node. I
tested with setting a location constraint for the underlying resource
(see configuration): After the failure-timeout has been completed, on
the next cluster-recheck (and only then!) the underlying resource and
its dependants return to the underlying resource's prefered location, as
you see in logs above.
More information about the Pacemaker
mailing list