[Pacemaker] [Problem] The attrd does not sometimes stop.

Andrew Beekhof andrew at beekhof.net
Mon Jan 16 05:46:58 UTC 2012


On Sun, Jan 15, 2012 at 1:57 AM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
> On Tue, Jan 10, 2012 at 04:43:51PM +0900, renayama19661014 at ybb.ne.jp wrote:
>> Hi Lars,
>>
>> I attach strace file when a problem reappeared at the end of last year.
>> I used glue which applied your patch for confirmation.
>>
>> It is the file which I picked with attrd by strace -p command right before I stop Heartbeat.
>>
>> Finally SIGTERM caught it, but attrd did not stop.
>> The attrd stopped afterwards when I sent SIGKILL.
>
> The strace reveals something interesting:
>
> This poll looks like the mainloop poll,
> but some ->prepare() has modified the timeout to be 0,
> so we proceed directly to ->check() and then ->dispatch().
>
>> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=8, events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=8, revents=POLLIN|POLLHUP}])
>
>> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738632
>> recv(4, 0x95af308, 576, MSG_DONTWAIT)   = -1 EAGAIN (Resource temporarily unavailable)
> ...
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = ? ERESTART_RESTARTBLOCK (To be restarted)
>> --- SIGTERM (Terminated) @ 0 (0) ---
>> sigreturn()                             = ? (mask now [])
>
> Ok. signal received, trigger set.
> Still finishing this mainloop iteration, though.
>
> These recv(),poll() look like invocations of G_CH_prepare_int().
> Does not matter much, though.
>
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
>> recv(7, 0x95b1657, 3513, MSG_DONTWAIT)  = -1 EAGAIN (Resource temporarily unavailable)
>> poll([{fd=7, events=0}], 1, 0)          = 0 (Timeout)
>
>> times({tms_utime=2, tms_stime=3, tms_cutime=0, tms_cstime=0}) = 433738634
>
> Now we proceed to the next mainloop poll:
>
>> poll([{fd=7, events=POLLIN|POLLPRI}, {fd=4, events=POLLIN|POLLPRI}, {fd=5, events=POLLIN|POLLPRI}], 3, -1
>
> Note the -1 (infinity timeout!)
>
> So even though the trigger was (presumably) set,
> and the ->prepare() should have returned true,
> the mainloop waits forever for "something" to happen on those file descriptors.
>
>
> I suggest this:
>
> crm_trigger_prepare should set *timeout = 0, if trigger is set.
>
> Also think about this race: crm_trigger_prepare was already
> called, only then the signal came in...
>
> diff --git a/lib/common/mainloop.c b/lib/common/mainloop.c
> index 2e8b1d0..fd17b87 100644
> --- a/lib/common/mainloop.c
> +++ b/lib/common/mainloop.c
> @@ -33,6 +33,13 @@ static gboolean
>  crm_trigger_prepare(GSource * source, gint * timeout)
>  {
>     crm_trigger_t *trig = (crm_trigger_t *) source;
> +    /* Do not delay signal processing by the mainloop poll stage */
> +    if (trig->trigger)
> +           *timeout = 0;
> +    /* To avoid races between signal delivery and the mainloop poll stage,
> +     * make sure we always have a finite timeout. Unit: milliseconds. */
> +    else
> +           *timeout = 5000; /* arbitrary */
>
>     return trig->trigger;
>  }
>
>
> This scenario does not let the blocked IPC off the hook, though.
> That is still possible, both for blocking send and blocking receive,
> so that should probably be fixed as well, somehow.
> I'm not sure how likely this "stuck in blocking IPC" is, though.

Interesting, are you sure you're in the right function though?
trigger and signal events don't have a file descriptor... wouldn't
these polls be for the IPC related sources and wouldn't they be
setting their own timeout?

>
> --
> : Lars Ellenberg
> : LINBIT | Your Way to High Availability
> : DRBD/HA support and consulting http://www.linbit.com
>
> DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list