[Pacemaker] Bug? Resources running with realtime priority - possibly causing monitor timeouts

Joschi Brauchle joschi.brauchle at tum.de
Tue Oct 1 09:07:35 UTC 2013


Hello everyone,

on two (recently upgraded) SLES11SP3 machines, we are running an 
active/passive NFS fileserver and several other high availability 
services using corosync + pacemaker (see version numbers below).

We are having severe problems with resource monitors timing out during 
our system backup at night, where the active machine is under high IO 
load. These problems did not exist under SLES11SP1, from which we just 
upgraded some days ago.


After some diagnosis, it turns out that actually all cluster resources 
which are started by pacemaker are running with realtime priority, which 
includes our backup service. This seems not to be correct!


See this output of "ps --forest -Ao cls,rtprio,pri,comm --sort cls":
------------
  RR      1  41 corosync
  RR      1  41  \_ cib
  RR      1  41  \_ stonithd
  RR      1  41  \_ lrmd
  RR      1  41  \_ attrd
  RR      1  41  \_ pengine
  RR      1  41  \_ crmd
  RR      1  41  \_ mgmtd
  RR      1  41 krb5kdc
  RR      1  41 slapd
  RR      1  41 cupsd
  RR      1  41 rpc.svcgssd
  RR      1  41 rpc.gssd
  RR      1  41 rpc.idmapd
  RR      1  41 rpc.mountd
  RR      1  41 rpc.statd
  RR      1  41 rpc.rquotad
  RR      1  41 httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41  \_ httpd2-prefork
  RR      1  41 dsmcad
------------
Clearly, corosync itself **plus all cluster services** (like cups, 
slapd, httpd2) are running with realtime priority (process class being 
"RR").

As far as we remember from SLES11SP1, the resources were not running in 
realtime priority there. Hence, this looks like a bug in the more recent 
pacemaker/corosync version?!?


We suspect that the backup software "dsmcad" running in realtime 
priority causes the monitors to time out, as the system is under heavy 
IO load and may not respond in time for the monitors.


More details about our setup:
------------
# hb_report -V
cluster-glue: 1.0.11 (8347e8c9b94f111400dd844f11bc6ede98cc11a5)
# zypper -q if cluster-glue pacemaker corosync
Information for package cluster-glue:

Repository: SLE11-HAE-SP3-Pool
Name: cluster-glue
Version: 1.0.11-0.15.28
Arch: x86_64
...
Information for package pacemaker:

Repository: SLE11-HAE-SP3-Pool
Name: pacemaker
Version: 1.1.9-0.19.102
Arch: x86_64
...
Information for package corosync:

Repository: SLE11-HAE-SP3-Pool
Name: corosync
Version: 1.4.5-0.18.15
Arch: x86_64
------------

I can provide more required information on request. We would be glad for 
any hits or suggestions on how to fix this problem.

Best regards,
J Brauchle

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4607 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131001/e3366b1e/attachment-0003.p7s>


More information about the Pacemaker mailing list