[Pacemaker] thread safety problem with pacemaker and corosync integration
sdake at redhat.com
Wed Feb 3 14:48:26 EST 2010
For some time people have reported segfaults on startup when using
pacemaker as a plugin to corosync related to tzset in the stack trace.
I believe we had fixed this by removing the thread-unsafe usage of
localtime and strftime calls in the code base of corosync in 1.2.0.
Via further investigation by H.J. Lee, he mostly identified a problem
with localtime_r calling tzset calling getenv(). If at about the same
time, another thread calls setenv(), the other thread's getenv could
segfault. syslog() also calls localtime_r in glibc. On some rare
occasions Pacemaker calls setenv() while corosync executes a syslog
operation resulting in a segfault.
Posix is clear on this issue - tzset should be thread safe, localtime_r
should be thread safe, syslog should be thread safe. Some C libraries
implementations of these functions unfortunately are not thread safe for
these functions when used in conjunction with setenv because they use
getenv internally (which is not required to be thread safe by posix).
Our short term plan is to workaround these problems in glibc by doing
1) providing a getenv/setenv api inside coroapi.h so that corosync
internal code and third party plugins such as pacemaker can use a mutex
2) porting our syslog-direct-communication code from whitetank and avoid
using the syslog C library api (which again uses localtime_r) call
3) implementing a localtime_r replacement which does not call tzset on
each execution so that timestamp:on operational mode does not suffer
from this same problem
If your suffering from this issue, please be aware we have a root cause
and will get it resolved.
More information about the Pacemaker