[Pacemaker] Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

Mario Penners mario.penners at gmail.com
Thu Aug 9 09:52:39 EDT 2012


Hi Parshvi,

just a quick-shot and without analyzing your mail in detail: find
attached an edited version of the IPaddr2 RA.

I was trying to use the original script a while agho, and basically
nothing worked: It did not recognize the link failures (due to the way
how the test was implemented it would only work if you have not more
than 1 IP per interface), there was no proper support for bonding, the
IP addresses would not be shifted ....


I did some (very minor) changes to ge the script working for us. Just
have a shot at it if you want, maybe replacing the RA will already solve
your problem.

Cheers,
Mario  


On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> Parshvi <parshvi.17 at ...> writes:
> 
> > 
> > Hi,
> > 
> > The monitor operation of IPaddr2 rsc agent is timing out.
> > Interval: 5s
> > Timeout: 60s
> > The timeout was increased from an earlier 20s to now 60s. Even then, there are 
> > multiple logs of monitor op. timing out.
> > 
> > 1) What can cause the monitor to take so long ?
> > 2) Looking at the pe-input, what contributes to the operation time ? Is it 
> just 
> > the exec-time or exec-time + queue-time ?
> > 3) Any solution proposed ?
> > 
> > I have lrm pe-input when the timeout was configured at 20s:
> > Here, is pe-input snapshot where monitor op. timed out (with timeout=20s)
> > 
> > <lrm_resource id="Group_1_ClusterIP" type="IPaddr2" class="ocf" 
> > provider="heartbeat">
> >             <lrm_rsc_op id="Group_1_ClusterIP_monitor_0" operation="monitor" 
> > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.1" transition-
> > key="28:0:7:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:7;28:0:7:6b445452-980a-455f-8616-7bd12f20843e" call-id="10" rc-
> code="7" 
> > op-status="0" interval="0" last-run="1343738096" last-rc-change="1343738096" 
> > exec-time="20" queue-time="30" op-digest="f22a042c86b227078b239707d4e4d4a2"/>
> > 
> >             <lrm_rsc_op id="Group_1_ClusterIP_start_0" operation="start" crm-
> > debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="0:0;87:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83503" 
> rc-
> > code="0" op-status="0" interval="0" last-run="1343928908" last-rc-
> > change="1343928908" exec-time="280" queue-time="20" op-
> > digest="f22a042c86b227078b239707d4e4d4a2"/>
> > 
> >             <lrm_rsc_op id="Group_1_ClusterIP_monitor_5000" 
> operation="monitor" 
> > crm-debug-origin="do_update_resource" crm_feature_set="3.0.1" transition-
> > key="12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" transition-
> > magic="2:-2;12:27957:0:6b445452-980a-455f-8616-7bd12f20843e" call-id="83504" 
> rc-
> > code="-2" op-status="2" interval="5000" last-rc-change="1343928921" exec-
> > time="20000" queue-time="0" op-digest="79c3bdd01c6e0fd819484536a54bf7a2"/>
> > (Please note exec-time=20000)
> > 
> 
> Following are the details of packages:
> cluster-glue: 1.0.6 (1c87a0c58c59fc384b93ec11476cefdbb6ddc1e1)
> resource-agents: # Build version: 7a11934b142d1daf42a04fbaa0391a3ac47cee4c
> CRM Version: 1.0.12 (unknown)
> pacemaker 1.0.12-1.el5.centos - (none) x86_64
> corosync 1.2.7-1.1.el5 - (none) x86_64
> resource-agents 1.0.4-1.1.el5 - (none) x86_64
> 
> There are 4 virtual IP resources configued:
> 
> Out of these, 3 recover after a monitor timeout but one Virtual IP rsc does not 
> recover. Following are the logs that are observed:
> 
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: run_graph: Transition 63579 
> (Complete=1, Pending=0, Fired=0, Skipped=0, Incomplete=8, 
> Source=/var/lib/pengine/pe-input-1660.bz2): Terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: ERROR: te_graph_trigger: Transition 
> failed: terminated
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Graph 63579 (9 
> actions in 9 synapses): batch-limit=30 jobs, network-delay=60000ms
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 0 was 
> confirmed (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 1 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 8]: 
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 103]: 
> Pending (id: Rsc2_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 2 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 97]: 
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 8]: 
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 3 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 98]: 
> Pending (id: Rsc1_GroupClusterIP_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 4 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 99]: 
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 5 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 100]: 
> Pending (id: Rsc3_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 99]: 
> Pending (id: Rsc3_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 6 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 101]: 
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 97]: 
> Pending (id: Rsc1_GroupClusterIP_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 7 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 102]: 
> Pending (id: Rsc4_monitor_1000, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 101]: 
> Pending (id: Rsc4_start_0, loc: CSS-FU-2, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_graph: Synapse 8 is pending 
> (priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:     [Action 36]: 
> Pending (id: all_stopped, type: pseduo, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: WARN: print_elem:      * [Input 8]: 
> Pending (id: Rsc1_GroupClusterIP_stop_0, loc: CSS-FU-1, priority: 0)
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: te_graph_trigger: Transition 63579 
> is now complete
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: notify_crmd: Transition 63579 
> status: done - <null>
> Jul 29 13:41:52 CSS-FU-1 crmd: [11165]: info: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 1) Please help as to why a monitor is timing out ?
> 2) Why does one of the VIP's fails to recover after a timeout ?
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: IPaddr2a
Type: application/x-shellscript
Size: 25561 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120809/21fc32f9/attachment-0003.bin>


More information about the Pacemaker mailing list