[Pacemaker] postponed execution of operations
Thilo Uttendorfer
t.uttendorfer at linux-ag.com
Thu Sep 8 18:04:03 UTC 2011
Hello,
I have sometimes failed resoucres in a heartbeat/pacemaker/drbd cluster. It seems that it happens randomly. The log
file for such a resource looks like this (I did a grep on "34401" in this example):
Sep 7 13:28:13 server-v2 lrmd: [25059]: debug: on_msg_perform_op: add an operation operation monitor[34401] on
ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] to the operation list.
Sep 7 13:28:13 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000
(call=34401, rc=0, cib-update=73376, confirmed=false) complete ok
Sep 7 13:41:56 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
[this message repeats every second...]
Sep 7 13:42:03 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:19:30 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
[this message repeats every second...]
Sep 7 14:19:44 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:19:45 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on
ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] stayed in operation list for 15010 ms (longer than 10000 ms)
Sep 7 14:20:05 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
[this message repeats every second...]
Sep 7 14:20:34 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:20:35 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on
ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] stayed in operation list for 30010 ms (longer than 10000 ms)
Sep 7 14:21:15 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:21:16 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:21:17 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
Sep 7 14:32:01 server-v2 crmd: [25062]: debug: cancel_op: Cancelling op 34401 for FS-win2k3FS
(FS-win2k3FS:34401)
Sep 7 14:32:01 server-v2 lrmd: [25059]: debug: cancel_op: operation monitor[34401] on
ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
directory=[/srv/vsb/guests/win2k3] fstype=[ext3] cancelled
Sep 7 14:32:01 server-v2 crmd: [25062]: debug: process_lrm_event: Op FS-win2k3FS_monitor_20000 (call=34401):
no delete event required
Sep 7 14:32:01 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000
(call=34401, rc=-2, cib-update=0, confirmed=true) Cancelled unknown exec error
My questions are:
- crm says "complete ok", so why is this operation postponed at all?
- Why does more than 13 minutes nothing happen between first "add an operation operation monitor" and
first "max_child_count"?
- Why are there some huge leaps between the "max_child_count"-messages (for example 13:42:03 and 14:19:30)
- Is this a bug or am I misreading the log files?
The system had not unusual high load in that time. We use pacemaker version 1.0.5+hg20090915-1, I know that's
quite ancient. If I should update to a more recent version, which version would you recommend? Can somebody
confirm that there were fixes for these problems?
Thanks for any help!
Thilo
--
Thilo Uttendorfer
Linux Information Systems AG
Putzbrunner Str. 71, 81739 München
Fon: +49 89 993412-11, Fax: +49 89 993412-99
t.uttendorfer at linux-ag.com, http://www.linux-ag.com
More information about the Pacemaker
mailing list