[Pacemaker] Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.
Mario Penners
mario.penners at gmail.com
Thu Aug 9 13:52:39 UTC 2012
Hi Parshvi,
just a quick-shot and without analyzing your mail in detail: find
attached an edited version of the IPaddr2 RA.
I was trying to use the original script a while agho, and basically
nothing worked: It did not recognize the link failures (due to the way
how the test was implemented it would only work if you have not more
than 1 IP per interface), there was no proper support for bonding, the
IP addresses would not be shifted ....
I did some (very minor) changes to ge the script working for us. Just
have a shot at it if you want, maybe replacing the RA will already solve
your problem.
Cheers,
Mario
On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> Parshvi <parshvi.17 at ...> writes:
>
> >
> > Hi,
> >
> > The monitor operation of IPaddr2 rsc agent is timing out.
> > Interval: 5s
> > Timeout: 60s
> > The timeout was increased from an earlier 20s to now 60s. Even then, there are
> > multiple logs of monitor op. timing out.
> >
> > 1) What can cause the monitor to take so long ?
> > 2) Looking at the pe-input, what contributes to the operation time ? Is it
> just
> > the exec-time or exec-time + queue-time ?
> > 3) Any solution proposed ?
> >
> > I have lrm pe-input when the timeout was configured at 20s:
> > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
> >
> > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf"
> > provider="heartbeat">
> > <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor"
> > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
> code="7"
> > op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096"
> > exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
> >
> > <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503"
> rc-
> > code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> > change="1343928908" exec-time="280" queue-time="20" op-
> > digest="f22a042c86b227078b239707d4e4d4a2"/>
> >
> > <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000"
> operation="monitor"
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504"
> rc-
> > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> > (Please note exec-time=20000)
> >
>
> Following are the details of packages:
> cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
> resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> CRM Version: 1.0.12 (unknown)
> pacemaker 1.0.12-1.el5.centos - (none) x86_64
> corosync 1.2.7-1.1.el5 - (none) x86_64
> resource-agents 1.0.4-1.1.el5 - (none) x86_64
>
> There are 4 virtual IP resources configued:
>
> Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not
> recover. Following are the logs that are observed:
>
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8,
> Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition
> failed: terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9
> actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was
> confirmed (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 103]:
> Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 98]:
> Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 99]:
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 100]:
> Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 99]:
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 101]:
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 97]:
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 102]:
> Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 101]:
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: [Action 36]:
> Pending (id: all_stopped, type: pseduo, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem: * [Input 8]:
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579
> is now complete
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579
> status: done - <null>
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> 1) Please help as to why a monitor is timing out ?
> 2) Why does one of the VIP's fails to recover after a timeout ?
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: IPaddr2a
Type: application/x-shellscript
Size: 25561 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120809/21fc32f9/attachment-0004.bin>
More information about the Pacemaker
mailing list