[Pacemaker] Critical: Monitor operation of IPaddr2 timing out, taking more than 60s. Fails to recover.

Thu Aug 16 01:31:36 EDT 2012

Dejan Muhamedagic <dejanmm at ...> writes:

Thanks Dejan for your inputs.

> > The are some more findings:
> > 1) The monitor is not timing out in all environments. I have been through 
some 
> > of the forum mails, and came across people talking about "heavy load on the 
> > system" wrt the timeout issue.
> > 2) Could somebody explain, what exactly are we referring to when we say 
"heavy 
> > load" ? Also, how does it affect an operations execution ?
> 
> Heavy load, as in many processes contending for system
> resources such as CPU or disk.

The server has 8 processors (4 cores). Analyzing the io stats, there isn't any 
contention on CPU. As for the disk, there appears to be high io wait at times. 
But this is on the shared storage, on local disk the no. of read/writes/sec is 
very low.
> 
> > 3) THE OPERATION MONITOR IS TIMING OUT ON OTHER RESOURCES TOO( ALONG WITH 
> > IPADDR2).
> 
> That seems to indicate that indeed there's a load which your
> computer cannot sustain. BTW, why uppercase?

Since, there appears to be sustainable load, could you suggest some solution ?
I considered the issue(point 3) critical, hence highlighted with upper case.

> 
> > 4) None of these operations were timing out in a local environment.
> > 
> > I added some logging in IPaddr2 resource agent script.
> > In func. ip_monitor(),I have printed the date at enter monitor and at exit 
> > monitor func.
> > This is what I observed for :
> > interval=5s
> > timeout=60s
> > 
> > enter monitor Thu Aug 9 06:26:28 AST 2012
> > exit monitor Thu Aug 9 06:26:28 AST 2012
> > 
> > enter monitor Thu Aug 9 06:26:36 AST 2012
> > exit monitor Thu Aug 9 06:26:36 AST 2012
> > 
> > [The next monitor was invoked after 71 seconds]
> > 
> > enter monitor Thu Aug 9 06:27:47 AST 2012
> > exit monitor Thu Aug 9 06:27:47 AST 2012
> > 
> > enter monitor Thu Aug 9 06:27:52 AST 2012
> > exit monitor Thu Aug 9 06:27:52 AST 2012
> 
> There's also code preventing more than n (by default 4)
> operations running in parallel on a single node. That could be
> one explanation of larger intervals between monitors.

I suppose the parameter you're referring to is max-children of lrm.
We have set that to 34 in our case, since we were hitting max-child count. The 
number of resources configured in pacemaker is 19.

> 
> Thanks,
> 
> Dejan