[Pacemaker] lrmd WARN on high IO load

Mon Aug 2 14:27:50 UTC 2010

Hi,

On Mon, Jul 19, 2010 at 07:09:11PM -0300, Diego Woitasen wrote:
> 2010/7/16 Diego Woitasen <diego at woitasen.com.ar>:
> > Hi,
> >  I've installed Heartbeat+Pacemaker (3.0.3 and 1.0.9). I have a
> > resource which executes an script to check the service:
> >
> > primitive kolab_imapd ocf:heartbeat:kolab-service \
> >        params service="all" monitor_script="/usr/local/bin/check-imap.py" \
> >        meta migration-threshold="3" failure-timeout="300s" is-managed="true" \
> >        operations $id="operations-imap" \
> >        op monitor interval="20s" timeout="30s" on-fail="restart" \
> >        op start interval="0" timeout="120" \
> >        op stop interval="0" timeout="120"
> >
> > I did I/O stress using bonnie++ and I started to see this message:
> >
> > Jul 16 18:24:38 imapserver lrmd: [4719]: WARN: perform_ra_op: the
> > operation operation monitor[21] on ocf::kolab-service::kolab_imapd for
> > client 4722, its parameters: CRM_meta_interval=[20000]
> > monitor_script=[/usr/local/bin/check-imap.py]
> > CRM_meta_on_fail=[restart] CRM_meta_timeout=[30000]
> > crm_feature_set=[3.0.1] CRM_meta_name=[monitor] service=[all]  stayed
> > in operation list for 32740 ms (longer than 10000 ms)
> >
> > The problem is that I've got this messages under High I/O without the
> > stress testing, for example running backups. If I understand that
> > message correctly the monitor operation didn't start, it was waiting
> > on some workqueue to start.

It was most probably waiting for the previous monitor operation
to finish, though that one should have timed out according to
your configuration. Or there were at least 4 operations on
different resources running on the node. If you expect high load
on the server, you should tune timeouts accordingly.

Thanks,

Dejan

> > If I try to execute a command while I'm running the stress it's slow
> > (3 seconds aprox.) but it works. For example, I can run "crm configure
> > show" and the output appears in 3 o 4 seconds.
> >
> > The server have 2 quad-core processors, 6 GB of RAM, running RHEL 5.
> >
> > Regards,
> >  Diego
> >
> > --
> > Diego Woitasen
> >
> 
> 
> I've rised the priority of the process to 10 and works now.
> 
> The documentations says that default rtprio is 5. That's wrong it's 1.
> At least in my pkgs...
> 
> Regards,
>  Diego
> 
> -- 
> Diego Woitasen
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker