[Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

Kazunori INOUE kazunori.inoue3 at gmail.com
Fri Jan 10 11:23:04 UTC 2014


2014/1/9 Andrew Beekhof <andrew at beekhof.net>:
>
> On 8 Jan 2014, at 9:15 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com> wrote:
>
>> 2014/1/8 Andrew Beekhof <andrew at beekhof.net>:
>>>
>>> On 18 Dec 2013, at 9:50 pm, Kazunori INOUE <kazunori.inoue3 at gmail.com> wrote:
>>>
>>>> Hi David,
>>>>
>>>> 2013/12/18 David Vossel <dvossel at redhat.com>:
>>>>>
>>>>> That's a really weird one... I don't see how it is possible for op->id to be NULL there.   You might need to give valgrind a shot to detect whatever is really going on here.
>>>>>
>>>>> -- Vossel
>>>>>
>>>> Thank you for advice. I try it.
>>>
>>> Any update on this?
>>>
>>
>> We are still investigating a cause. It was not reproduced when I gave valgrind..
>> And it was reproduced in RC3.
>
> So it happened RC3 - valgrind, but not RC3 + valgrind?
> Thats concerning.
>
> Nothing in the valgrind output?
>

The cause was found.

230 gboolean
231 operation_finalize(svc_action_t * op)
232 {
233     int recurring = 0;
234
235     if (op->interval) {
236         if (op->cancel) {
237             op->status = PCMK_LRM_OP_CANCELLED;
238             cancel_recurring_action(op);
239         } else {
240             recurring = 1;
241             op->opaque->repeat_timer = g_timeout_add(op->interval,
242
recurring_action_timer, (void *)op);
243         }
244     }
245
246     if (op->opaque->callback) {
247         op->opaque->callback(op);
248     }
249
250     op->pid = 0;
251
252     if (!recurring) {
253         /*
254          * If this is a recurring action, do not free explicitly.
255          * It will get freed whenever the action gets cancelled.
256          */
257         services_action_free(op);
258         return TRUE;
259     }
260     return FALSE;
261 }

When op->id is not 0, in cancel_recurring_action function (l.238), op
is not removed from hash table.
However, op is freed in services_action_free function (l.257).  That
is, the freed data remains in hash table.
Then, g_hash_table_lookup function may look up the freed data.

Therefore, when g_hash_table_replace function was called (in
services_action_async function), I added change so that
g_hash_table_remove function might certainly be called.
As of now, segfault has not happened.

diff --git a/lib/services/services.c b/lib/services/services.c
index cb02511..73d0492 100644
--- a/lib/services/services.c
+++ b/lib/services/services.c
@@ -347,9 +347,9 @@ services_action_free(svc_action_t * op)
 }

 gboolean
-cancel_recurring_action(svc_action_t * op)
+cancel_recurring_action(svc_action_t * op, gboolean force_remove)
 {
-    if (op->pid) {
+    if (force_remove == FALSE && op->pid) {
         return FALSE;
     }

@@ -378,7 +378,7 @@ services_action_cancel(const char *name, const
char *action, int interval)
         return FALSE;
     }

-    if (cancel_recurring_action(op)) {
+    if (cancel_recurring_action(op, FALSE)) {
         op->status = PCMK_LRM_OP_CANCELLED;
         if (op->opaque->callback) {
             op->opaque->callback(op);
diff --git a/lib/services/services_linux.c b/lib/services/services_linux.c
index 7060be0..a02f8d9 100644
--- a/lib/services/services_linux.c
+++ b/lib/services/services_linux.c
@@ -235,7 +235,7 @@ operation_finalize(svc_action_t * op)
     if (op->interval) {
         if (op->cancel) {
             op->status = PCMK_LRM_OP_CANCELLED;
-            cancel_recurring_action(op);
+            cancel_recurring_action(op, TRUE);
         } else {
             recurring = 1;
             op->opaque->repeat_timer = g_timeout_add(op->interval,
diff --git a/lib/services/services_private.h b/lib/services/services_private.h
index dd759e3..72dc1ba 100644
--- a/lib/services/services_private.h
+++ b/lib/services/services_private.h
@@ -45,7 +45,7 @@ GList *resources_os_list_ocf_agents(const char *provider);

 GList *resources_os_list_nagios_agents(void);

-gboolean cancel_recurring_action(svc_action_t * op);
+gboolean cancel_recurring_action(svc_action_t * op, gboolean force_remove);

 gboolean recurring_action_timer(gpointer data);
 gboolean operation_finalize(svc_action_t * op);

>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>




More information about the Pacemaker mailing list