[Pacemaker] postponed execution of operations
Dejan Muhamedagic
dejanmm at fastmail.fm
Fri Sep 16 15:25:07 UTC 2011
Hi,
On Thu, Sep 08, 2011 at 08:04:03PM +0200, Thilo Uttendorfer wrote:
> Hello,
>
> I have sometimes failed resoucres in a heartbeat/pacemaker/drbd cluster. It seems that it happens randomly. The log
> file for such a resource looks like this (I did a grep on "34401" in this example):
>
> Sep 7 13:28:13 server-v2 lrmd: [25059]: debug: on_msg_perform_op: add an operation operation monitor[34401] on
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] to the operation list.
> Sep 7 13:28:13 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000
> (call=34401, rc=0, cib-update=73376, confirmed=false) complete ok
> Sep 7 13:41:56 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> [this message repeats every second...]
> Sep 7 13:42:03 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:19:30 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> [this message repeats every second...]
> Sep 7 14:19:44 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:19:45 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] stayed in operation list for 15010 ms (longer than 10000 ms)
> Sep 7 14:20:05 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> [this message repeats every second...]
> Sep 7 14:20:34 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:20:35 server-v2 lrmd: [25059]: WARN: perform_ra_op: the operation operation monitor[34401] on
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] stayed in operation list for 30010 ms (longer than 10000 ms)
> Sep 7 14:21:15 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:21:16 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:21:17 server-v2 lrmd: [25059]: notice: max_child_count (4) reached, postponing execution of operation
> monitor[34401] on ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor]
> crm_feature_set=[3.0.1] device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] by 1000 ms
> Sep 7 14:32:01 server-v2 crmd: [25062]: debug: cancel_op: Cancelling op 34401 for FS-win2k3FS
> (FS-win2k3FS:34401)
> Sep 7 14:32:01 server-v2 lrmd: [25059]: debug: cancel_op: operation monitor[34401] on
> ocf::Filesystem::FS-win2k3FS for client 25062, its parameters: CRM_meta_name=[monitor] crm_feature_set=[3.0.1]
> device=[/dev/drbd/by-res/win2k3] CRM_meta_timeout=[40000] CRM_meta_interval=[20000]
> directory=[/srv/vsb/guests/win2k3] fstype=[ext3] cancelled
> Sep 7 14:32:01 server-v2 crmd: [25062]: debug: process_lrm_event: Op FS-win2k3FS_monitor_20000 (call=34401):
> no delete event required
> Sep 7 14:32:01 server-v2 crmd: [25062]: info: process_lrm_event: LRM operation FS-win2k3FS_monitor_20000
> (call=34401, rc=-2, cib-update=0, confirmed=true) Cancelled unknown exec error
>
> My questions are:
> - crm says "complete ok", so why is this operation postponed at all?
complete ok is somewhat misleading, I think. I think that
confirmed=false means that there was no callback yet for the
operation.
> - Why does more than 13 minutes nothing happen between first "add an operation operation monitor" and
> first "max_child_count"?
The repeating monitor operation is silent, unless there's
something important to report. Within those 13 minutes there
should've been n monitor operations. If not, then it's really
strange.
> - Why are there some huge leaps between the "max_child_count"-messages (for example 13:42:03 and 14:19:30)
> - Is this a bug or am I misreading the log files?
It's sort of dynamic :) If there are more than max_child_count
resource operations running at the time another operation is
supposed to be scheduled, you'll see this message.
> The system had not unusual high load in that time. We use pacemaker version 1.0.5+hg20090915-1, I know that's
> quite ancient. If I should update to a more recent version, which version would you recommend? Can somebody
> confirm that there were fixes for these problems?
Apart from upgrading, you can increase the LRM max children.
That's a fairly new addition, not sure if your installation
supports that. Try:
# lrmadmin -p max-children 8
I'm also not sure if your init script supports setting it in
/etc/default files. That's also a relatively recent addition.
Thanks,
Dejan
> Thanks for any help!
>
> Thilo
>
>
> --
> Thilo Uttendorfer
> Linux Information Systems AG
> Putzbrunner Str. 71, 81739 München
>
> Fon: +49 89 993412-11, Fax: +49 89 993412-99
> t.uttendorfer at linux-ag.com, http://www.linux-ag.com
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list