[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts
Dejan Muhamedagic
dejanmm at fastmail.fm
Tue Oct 1 17:22:12 UTC 2013
Hi,
On Tue, Oct 01, 2013 at 11:07:35AM +0200, Joschi Brauchle wrote:
> Hello everyone,
>
> on two (recently upgraded) SLES11SP3 machines, we are running an
> active/passive NFS fileserver and several other high availability
> services using corosync + pacemaker (see version numbers below).
>
> We are having severe problems with resource monitors timing out
> during our system backup at night, where the active machine is under
> high IO load. These problems did not exist under SLES11SP1, from
> which we just upgraded some days ago.
>
>
> After some diagnosis, it turns out that actually all cluster
> resources which are started by pacemaker are running with realtime
> priority, which includes our backup service. This seems not to be
> correct!
>
>
> See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls":
> ------------
> RR 1 41 corosync
> RR 1 41 \_ cib
> RR 1 41 \_ stonithd
> RR 1 41 \_ lrmd
> RR 1 41 \_ attrd
> RR 1 41 \_ pengine
> RR 1 41 \_ crmd
> RR 1 41 \_ mgmtd
> RR 1 41 krb5kdc
> RR 1 41 slapd
> RR 1 41 cupsd
> RR 1 41 rpc.svcgssd
> RR 1 41 rpc.gssd
> RR 1 41 rpc.idmapd
> RR 1 41 rpc.mountd
> RR 1 41 rpc.statd
> RR 1 41 rpc.rquotad
> RR 1 41 httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 \_ httpd2-prefork
> RR 1 41 dsmcad
> ------------
> Clearly, corosync itself **plus all cluster services** (like cups,
> slapd, httpd2) are running with realtime priority (process class
> being "RR").
Oops. Looks like neither corosync nor lrmd reset the priority and
scheduler for their children.
> As far as we remember from SLES11SP1, the resources were not running
> in realtime priority there. Hence, this looks like a bug in the more
> recent pacemaker/corosync version?!?
Looks like it. Can you please open a support call.
Thanks,
Dejan
> We suspect that the backup software "dsmcad" running in realtime
> priority causes the monitors to time out, as the system is under
> heavy IO load and may not respond in time for the monitors.
>
>
> More details about our setup:
> ------------
> # hb_report -V
> cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5)
> # zypper -q if cluster-glue pacemaker corosync
> Information for package cluster-glue:
>
> Repository: SLE11-HAE-SP3-Pool
> Name: cluster-glue
> Version: 1.0.11-0.15.28
> Arch: x86_64
> ...
> Information for package pacemaker:
>
> Repository: SLE11-HAE-SP3-Pool
> Name: pacemaker
> Version: 1.1.9-0.19.102
> Arch: x86_64
> ...
> Information for package corosync:
>
> Repository: SLE11-HAE-SP3-Pool
> Name: corosync
> Version: 1.4.5-0.18.15
> Arch: x86_64
> ------------
>
> I can provide more required information on request. We would be glad
> for any hits or suggestions on how to fix this problem.
>
> Best regards,
> J Brauchle
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list