[Pacemaker] [Problem] The attrd does not sometimes stop.

Lars Ellenberg lars.ellenberg at linbit.com
Mon Jan 16 08:30:08 UTC 2012


On Mon, Jan 16, 2012 at 04:46:58PM +1100, Andrew Beekhof wrote:
> > Now we proceed to the next mainloop poll:
> >
> >> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1
> >
> > Note the -1 (infinity timeout!)
> >
> > So even though the trigger was (presumably) set,
> > and the ->prepare() should have returned true,
> > the mainloop waits forever for "something" to happen on those file descriptors.
> >
> >
> > I suggest this:
> >
> > crm_trigger_prepare should set *timeout = 0, if trigger is set.
> >
> > Also think about this race: crm_trigger_prepare was already
> > called, only then the signal came in...
> >
> > diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> > index 2e8b1d0..fd17b87 100644
> > --- a/lib/common/mainloop.c
> > +++ b/lib/common/mainloop.c
> > @@ -33,6 +33,13 @@ static gboolean
> >  crm_trigger_prepare(GSource * source, gint * timeout)
> >  {
> >     crm_trigger_t *trig = (crm_trigger_t *) source;
> > +    /* Do not delay signal processing by the mainloop poll stage */
> > +    if (trig->trigger)
> > +           *timeout = 0;
> > +    /* To avoid races between signal delivery and the mainloop poll stage,
> > +     * make sure we always have a finite timeout. Unit: milliseconds. */
> > +    else
> > +           *timeout = 5000; /* arbitrary */
> >
> >     return trig->trigger;
> >  }
> >
> >
> > This scenario does not let the blocked IPC off the hook, though.
> > That is still possible, both for blocking send and blocking receive,
> > so that should probably be fixed as well, somehow.
> > I'm not sure how likely this "stuck in blocking IPC" is, though.
> 
> Interesting, are you sure you're in the right function though?
> trigger and signal events don't have a file descriptor... wouldn't
> these polls be for the IPC related sources and wouldn't they be
> setting their own timeout?

http://developer.gnome.org/glib/2.30/glib-The-Main-Event-Loop.html#GSourceFuncs

iiuc, mainloop does something similar to (oversimplified):
 	timeout = -1; /* infinity */
	for s in all GSource
		tmp_timeout = -1;
		s->prepare(s, &tmp_timeout)
		if (tmp_timeout >= 0 && tmp_timeout < timeout)
			timeout = tmp_timeout;

	poll(GSource fd set, n, timeout);

	for s in all GSource
		if s->check(s)
			s->dispatch(s, ...)

And at some stage it also orders by priority, of course.

Also compare with the comment above /* Sigh... */ in glue G_SIG_prepare().

BTW, the mentioned race between signal delivery and mainloop already
doing the poll stage could potentially be solved by using
cl_signal_set_interrupt(SIGTERM, 1),
which would mean we can condense the prepare to
	if (trig->trigger)
		*timeout = 0;
	return trig->trigger;

Glue (and heartbeat) code base is not that, let's say, involved,
because someone had been paranoid.
But because someone had been paranoid for a reason ;-)

Cheers,

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.




More information about the Pacemaker mailing list