[Pacemaker] [Problem] The attrd does not sometimes stop.
Lars Ellenberg
lars.ellenberg at linbit.com
Mon Jan 16 08:30:08 UTC 2012
On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
> > Now we proceed to the next mainloop poll:
> >
> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1
> >
> > Note the -1 (infinity timeout!)
> >
> > So even though the trigger was (presumably) set,
> > and the ->prepare() should have returned true,
> > the mainloop waits forever for "something" to happen on those file descriptors.
> >
> >
> > I suggest this:
> >
> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
> >
> > Also think about this race: crm_trigger_prepare was already
> > called, only then the signal came in...
> >
> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> > index 2e8b1d0..fd17b87 100644
> > --- a/lib/common/mainloop.c
> > +++ b/lib/common/mainloop.c
> > @@ -33,6 +33,13 @@ static gboolean
> > crm_trigger_prepare(GSource * source, gint * timeout)
> > {
> > crm_trigger_t *trig = (crm_trigger_t *) source;
> > + /* Do not delay signal processing by the mainloop poll stage */
> > + if (trig->trigger)
> > + *timeout = 0;
> > + /* To avoid races between signal delivery and the mainloop poll stage,
> > + * make sure we always have a finite timeout. Unit: milliseconds. */
> > + else
> > + *timeout = 5000; /* arbitrary */
> >
> > return trig->trigger;
> > }
> >
> >
> > This scenario does not let the blocked IPC off the hook, though.
> > That is still possible, both for blocking send and blocking receive,
> > so that should probably be fixed as well, somehow.
> > I'm not sure how likely this "stuck in blocking IPC" is, though.
>
> Interesting, are you sure you're in the right function though?
> trigger and signal events don't have a file descriptor... wouldn't
> these polls be for the IPC related sources and wouldn't they be
> setting their own timeout?
http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs
iiuc, mainloop does something similar to (oversimplified):
timeout = -1; /* infinity */
for s in all GSource
tmp_timeout = -1;
s->prepare(s, &tmp_timeout)
if (tmp_timeout >= 0 && tmp_timeout < timeout)
timeout = tmp_timeout;
poll(GSource fd set, n, timeout);
for s in all GSource
if s->check(s)
s->dispatch(s, ...)
And at some stage it also orders by priority, of course.
Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().
BTW, the mentioned race between signal delivery and mainloop already
doing the poll stage could potentially be solved by using
cl_signal_set_interrupt(SIGTERM, 1),
which would mean we can condense the prepare to
if (trig->trigger)
*timeout = 0;
return trig->trigger;
Glue (and heartbeat) code base is not that, let's say, involved,
because someone had been paranoid.
But because someone had been paranoid for a reason ;-)
Cheers,
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
More information about the Pacemaker
mailing list