[Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
David Vossel
dvossel at redhat.com
Fri Jan 10 15:23:20 UTC 2014
----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Friday, January 10, 2014 5:23:04 AM
> Subject: Re: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
>
> 2014/1/9 Andrew Beekhof <andrew at beekhof.net>:
> >
> > On 8 Jan 2014, at 9:15 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com>
> > wrote:
> >
> >> 2014/1/8 Andrew Beekhof <andrew at beekhof.net>:
> >>>
> >>> On 18 Dec 2013, at 9:50 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com>
> >>> wrote:
> >>>
> >>>> Hi David,
> >>>>
> >>>> 2013/12/18 David Vossel <dvossel at redhat.com>:
> >>>>>
> >>>>> That's a really weird one... I don't see how it is possible for op->id
> >>>>> to be NULL there. You might need to give valgrind a shot to detect
> >>>>> whatever is really going on here.
> >>>>>
> >>>>> -- Vossel
> >>>>>
> >>>> Thank you for advice. I try it.
> >>>
> >>> Any update on this?
> >>>
> >>
> >> We are still investigating a cause. It was not reproduced when I gave
> >> valgrind..
> >> And it was reproduced in RC3.
> >
> > So it happened RC3 - valgrind, but not RC3 + valgrind?
> > Thats concerning.
> >
> > Nothing in the valgrind output?
> >
>
> The cause was found.
>
> 230 gboolean
> 231 operation_finalize(svc_action_t * op)
> 232 {
> 233 int recurring = 0;
> 234
> 235 if (op->interval) {
> 236 if (op->cancel) {
> 237 op->status = PCMK_LRM_OP_CANCELLED;
> 238 cancel_recurring_action(op);
> 239 } else {
> 240 recurring = 1;
> 241 op->opaque->repeat_timer = g_timeout_add(op->interval,
> 242
> recurring_action_timer, (void *)op);
> 243 }
> 244 }
> 245
> 246 if (op->opaque->callback) {
> 247 op->opaque->callback(op);
> 248 }
> 249
> 250 op->pid = 0;
> 251
> 252 if (!recurring) {
> 253 /*
> 254 * If this is a recurring action, do not free explicitly.
> 255 * It will get freed whenever the action gets cancelled.
> 256 */
> 257 services_action_free(op);
> 258 return TRUE;
> 259 }
> 260 return FALSE;
> 261 }
>
> When op->id is not 0, in cancel_recurring_action function (l.238), op
> is not removed from hash table.
> However, op is freed in services_action_free function (l.257). That
> is, the freed data remains in hash table.
> Then, g_hash_table_lookup function may look up the freed data.
>
> Therefore, when g_hash_table_replace function was called (in
> services_action_async function), I added change so that
> g_hash_table_remove function might certainly be called.
> As of now, segfault has not happened.
Awesome, thanks for tracking this down. I created a modified version of your patch and put it up for review as a pacemaker pull request.
https://github.com/ClusterLabs/pacemaker/pull/408
-- Vossel
More information about the Pacemaker
mailing list