[Pacemaker] [Problem] The attrd does not sometimes stop.

Lars Ellenberg lars.ellenberg at linbit.com
Sat Jan 14 14:57:12 UTC 2012


On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661014 at ybb.ne.jp wrote:
> Hi Lars,
> 
> I attach strace file when a problem reappeared at the end of last year.
> I used glue which applied your patch for confirmation.
> 
> It is the file which I picked with attrd by strace -p command right before I stop Heartbeat.
> 
> Finally SIGTERM caught it, but attrd did not stop.
> The attrd stopped afterwards when I sent SIGKILL.

The strace reveals something interesting:

This poll looks like the mainloop poll,
but some ->prepare() has modified the timeout to be 0,
so we proceed directly to ->check() and then ->dispatch().

> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])

> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
> recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily unavailable)
...
> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be restarted)
> --- SIGTERM (Terminated) @ 0 (0) ---
> sigreturn()                             = ? (mask now [])

Ok. signal received, trigger set.
Still finishing this mainloop iteration, though.

These recv(),poll() look like invocations of G_CH_prepare_int().
Does not matter much, though.

> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)

> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634

Now we proceed to the next mainloop poll:

> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1

Note the -1 (infinity timeout!)

So even though the trigger was (presumably) set,
and the ->prepare() should have returned true,
the mainloop waits forever for "something" to happen on those file descriptors.


I suggest this:

crm_trigger_prepare should set *timeout = 0, if trigger is set.

Also think about this race: crm_trigger_prepare was already
called, only then the signal came in...

diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
index 2e8b1d0..fd17b87 100644
--- a/lib/common/mainloop.c
+++ b/lib/common/mainloop.c
@@ -33,6 +33,13 @@ static gboolean
 crm_trigger_prepare(GSource * source, gint * timeout)
 {
     crm_trigger_t *trig = (crm_trigger_t *) source;
+    /* Do not delay signal processing by the mainloop poll stage */
+    if (trig->trigger)
+	    *timeout = 0;
+    /* To avoid races between signal delivery and the mainloop poll stage,
+     * make sure we always have a finite timeout. Unit: milliseconds. */
+    else
+	    *timeout = 5000; /* arbitrary */
 
     return trig->trigger;
 }


This scenario does not let the blocked IPC off the hook, though.
That is still possible, both for blocking send and blocking receive,
so that should probably be fixed as well, somehow.
I'm not sure how likely this "stuck in blocking IPC" is, though.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.




More information about the Pacemaker mailing list