[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts
Joschi Brauchle
joschi.brauchle at tum.de
Tue Oct 1 09:07:35 UTC 2013
Hello everyone,
on two (recently upgraded) SLES11SP3 machines, we are running an
active/passive NFS fileserver and several other high availability
services using corosync + pacemaker (see version numbers below).
We are having severe problems with resource monitors timing out during
our system backup at night, where the active machine is under high IO
load. These problems did not exist under SLES11SP1, from which we just
upgraded some days ago.
After some diagnosis, it turns out that actually all cluster resources
which are started by pacemaker are running with realtime priority, which
includes our backup service. This seems not to be correct!
See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls":
------------
RR 1 41 corosync
RR 1 41 \_ cib
RR 1 41 \_ stonithd
RR 1 41 \_ lrmd
RR 1 41 \_ attrd
RR 1 41 \_ pengine
RR 1 41 \_ crmd
RR 1 41 \_ mgmtd
RR 1 41 krb5kdc
RR 1 41 slapd
RR 1 41 cupsd
RR 1 41 rpc.svcgssd
RR 1 41 rpc.gssd
RR 1 41 rpc.idmapd
RR 1 41 rpc.mountd
RR 1 41 rpc.statd
RR 1 41 rpc.rquotad
RR 1 41 httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 \_ httpd2-prefork
RR 1 41 dsmcad
------------
Clearly, corosync itself **plus all cluster services** (like cups,
slapd, httpd2) are running with realtime priority (process class being
"RR").
As far as we remember from SLES11SP1, the resources were not running in
realtime priority there. Hence, this looks like a bug in the more recent
pacemaker/corosync version?!?
We suspect that the backup software "dsmcad" running in realtime
priority causes the monitors to time out, as the system is under heavy
IO load and may not respond in time for the monitors.
More details about our setup:
------------
# hb_report -V
cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5)
# zypper -q if cluster-glue pacemaker corosync
Information for package cluster-glue:
Repository: SLE11-HAE-SP3-Pool
Name: cluster-glue
Version: 1.0.11-0.15.28
Arch: x86_64
...
Information for package pacemaker:
Repository: SLE11-HAE-SP3-Pool
Name: pacemaker
Version: 1.1.9-0.19.102
Arch: x86_64
...
Information for package corosync:
Repository: SLE11-HAE-SP3-Pool
Name: corosync
Version: 1.4.5-0.18.15
Arch: x86_64
------------
I can provide more required information on request. We would be glad for
any hits or suggestions on how to fix this problem.
Best regards,
J Brauchle
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4607 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131001/e3366b1e/attachment-0003.p7s>
More information about the Pacemaker
mailing list