[Pacemaker] bug in monitor timeout?

Thu Oct 4 12:18:33 CEST 2012

> Hi,
> 
> On Wed, Oct 03, 2012 at 10:07:06PM +0000, James Harper wrote:
> > It seems like everytime I modify a resource, things start timing out. Just
> > now I changed the location of where a ping resource could run and this
> > happened:
> > Oct  4 07:07:07 bitvs5 lrmd: [3681]: WARN: perform_ra_op: the
> > operation monitor[52] on p_lvm_iscsi:0 for client 3686 stayed in
> > operation list for 22000 ms (longer than 10000 ms)
> 
> That's interesting. Normally such a change should result in just a few
> operations. Did you take a look at the transition which resulted from this
> change?

I don't think I know how to do that. All I changed though was a ping resource that wasn't (yet) in use so I can't see that too much could have changed.

> > Another oddity is that the resource for p_lvm_iscsi is defined as:
> >
> > primitive p_lvm_iscsi ocf:heartbeat:LVM \
> >         params volgrpname="vg-drbd" \
> >         op start interval="0" timeout="30s" \
> >         op stop interval="0" timeout="30s" \
> >         op monitor interval="10s" timeout="30s"
> >
> > so I don't know where the timeout of 10000ms is coming from??
> >
> > When I change something with crm configure the cib process shoots up to
> > 100% CPU and stays there for a while, and the node becomes more-or-less
> > unresponsive, which may go some way to explaining why things time out. Is
> > this normal? It doesn't explain why lrmd complains that something took
> > longer than 10s when I set the timeout to 30s though, unless the interval
> > somehow interacts with that?
> 
> Ten seconds is an ad-hoc time and has nothing to do with specific timeouts.
> lrmd logs a warning if an operation stays in the queue for longer than that.

Ah. I leaped to the wrong conclusion there then.

> How many resources do you have? You can also increase max-children (a
> lrmd parameter), which is a number of operations that lrmd is allowed to run
> concurrently (lrmadmin -p max-children n, by default it's set to 4).

crm status says:

Last updated: Thu Oct  4 19:42:30 2012
Last change: Thu Oct  4 08:21:47 2012
Stack: openais
Current DC: <xxx> - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
5 Nodes configured, 5 expected votes
81 Resources configured.

I'm not sure how many is considered a lot, but I can't think that 81 would rate very highly.

> > Versions of software are all from Debian Wheezy:
> > corosync 1.4.2-3
> > pacemaker 1.1.7-1
> 
> I'd suggest to open a bugzilla and include hb_report (or crm_report,
> whatever your distribution ships).
> 

I plan on doing a bit more testing on the weekend to see if I can find a bit more information about exactly what is going on, otherwise any bug report is going to be a bit vague.

Thanks

James