[Pacemaker] bug in monitor timeout?

Thu Oct 4 14:54:43 UTC 2012

On 10/04/2012 12:18 PM, James Harper wrote:
>> Hi,
>>
>> On Wed, Oct 03, 2012 at 10:07:06PM +0000, James Harper wrote:
>>> It seems like everytime I modify a resource, things start timing out. Just
>>> now I changed the location of where a ping resource could run and this
>>> happened:
>>> Oct  4 07:07:07 bitvs5 lrmd: [3681]: WARN: perform_ra_op: the
>>> operation monitor[52] on p_lvm_iscsi:0 for client 3686 stayed in
>>> operation list for 22000 ms (longer than 10000 ms)
>>
>> That's interesting. Normally such a change should result in just a few
>> operations. Did you take a look at the transition which resulted from this
>> change?
> 
> I don't think I know how to do that. All I changed though was a ping resource that wasn't (yet) in use so I can't see that too much could have changed.
> 
>>> Another oddity is that the resource for p_lvm_iscsi is defined as:
>>>
>>> primitive p_lvm_iscsi ocf:heartbeat:LVM \
>>>         params volgrpname="vg-drbd" \
>>>         op start interval="0" timeout="30s" \
>>>         op stop interval="0" timeout="30s" \
>>>         op monitor interval="10s" timeout="30s"
>>>
>>> so I don't know where the timeout of 10000ms is coming from??
>>>
>>> When I change something with crm configure the cib process shoots up to
>>> 100% CPU and stays there for a while, and the node becomes more-or-less
>>> unresponsive, which may go some way to explaining why things time out. Is
>>> this normal? It doesn't explain why lrmd complains that something took
>>> longer than 10s when I set the timeout to 30s though, unless the interval
>>> somehow interacts with that?
>>
>> Ten seconds is an ad-hoc time and has nothing to do with specific timeouts.
>> lrmd logs a warning if an operation stays in the queue for longer than that.
> 
> Ah. I leaped to the wrong conclusion there then.
> 
>> How many resources do you have? You can also increase max-children (a
>> lrmd parameter), which is a number of operations that lrmd is allowed to run
>> concurrently (lrmadmin -p max-children n, by default it's set to 4).
> 
> crm status says:
> 
> Last updated: Thu Oct  4 19:42:30 2012
> Last change: Thu Oct  4 08:21:47 2012
> Stack: openais
> Current DC: <xxx> - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 5 Nodes configured, 5 expected votes
> 81 Resources configured.
> 
> I'm not sure how many is considered a lot, but I can't think that 81 would rate very highly.

Yes, that is a lot for Pacemaker ... 5 nodes and each node has a status
section including operation results for every resource ... cib can get
large quite fast.

You tried setting a higher "batch-limit" in your properties? Do you see
any corosync messages when applying such changes? There is a good chance
you also need to tune corosync timings.

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now
> 
>>> Versions of software are all from Debian Wheezy:
>>> corosync 1.4.2-3
>>> pacemaker 1.1.7-1
>>
>> I'd suggest to open a bugzilla and include hb_report (or crm_report,
>> whatever your distribution ships).
>>
> 
> I plan on doing a bit more testing on the weekend to see if I can find a bit more information about exactly what is going on, otherwise any bug report is going to be a bit vague.
> 
> Thanks
> 
> James
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121004/52d9bce0/attachment-0004.sig>