[Pacemaker] Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

Mon Aug 13 16:03:22 UTC 2012

Hi,

On Fri, Aug 10, 2012 at 05:44:47AM +0000, Parshvi wrote:
> T
> Mario Penners <mario.penners at ...> writes:
> 
> > 
> > Hi Parshvi,
> > 
> > just a quick-shot and without analyzing your mail in detail: find
> > attached an edited version of the IPaddr2 RA.
> > 
> > I was trying to use the original script a while agho, and basically
> > nothing worked: It did not recognize the link failures (due to the way
> > how the test was implemented it would only work if you have not more
> > than 1 IP per interface), there was no proper support for bonding, the
> > IP addresses would not be shifted ....
> > 
> > I did some (very minor) changes to ge the script working for us. Just
> > have a shot at it if you want, maybe replacing the RA will already solve
> > your problem.
> > 
> > Cheers,
> > Mario  
> > 
> > On Thu, 2012-08-09 at 05:44 +0000, Parshvi wrote:
> > > Parshvi <parshvi.17 at ...> writes:
> > > 
> > > > 
> > > > Hi,
> > > > 
> > > > The monitor operation of IPaddr2 rsc agent is timing out.
> > > > Interval: 5s
> > > > Timeout: 60s
> > > > The timeout was increased from an earlier 20s to now 60s. Even then, there 
> are 
> > > > multiple logs of monitor op. timing out.
> > > > 
> > > > 1) What can cause the monitor to take so long ?
> > > > 2) Looking at the pe-input, what contributes to the operation time ? Is it 
> > > just 
> > > > the exec-time or exec-time + queue-time ?
> > > > 3) Any solution proposed ?
> > > > 
> > 
> 
> Thanks Mario for your input.
> 
> The are some more findings:
> 1) The monitor is not timing out in all environments. I have been through some 
> of the forum mails, and came across people talking about "heavy load on the 
> system" wrt the timeout issue.
> 2) Could somebody explain, what exactly are we referring to when we say "heavy 
> load" ? Also, how does it affect an operations execution ?

Heavy load, as in many processes contending for system
resources such as CPU or disk.

> 3) THE OPERATION MONITOR IS TIMING OUT ON OTHER RESOURCES TOO( ALONG WITH 
> IPADDR2).

That seems to indicate that indeed there's a load which your
computer cannot sustain. BTW, why uppercase?

> 4) None of these operations were timing out in a local environment.
> 
> I added some logging in IPaddr2 resource agent script.
> In func. ip_monitor(),I have printed the date at enter monitor and at exit 
> monitor func.
> This is what I observed for :
> interval=5s
> timeout=60s
> 
> enter monitor Thu Aug 9 06:26:28 AST 2012
> exit monitor Thu Aug 9 06:26:28 AST 2012
> 
> enter monitor Thu Aug 9 06:26:36 AST 2012
> exit monitor Thu Aug 9 06:26:36 AST 2012
> 
> [The next monitor was invoked after 71 seconds]
> 
> enter monitor Thu Aug 9 06:27:47 AST 2012
> exit monitor Thu Aug 9 06:27:47 AST 2012
> 
> enter monitor Thu Aug 9 06:27:52 AST 2012
> exit monitor Thu Aug 9 06:27:52 AST 2012

There's also code preventing more than n (by default 4)
operations running in parallel on a single node. That could be
one explanation of larger intervals between monitors.

Thanks,

Dejan

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org