[Pacemaker] crmd was aborted at pacemaker 1.1.11

David Vossel dvossel at redhat.com
Mon Mar 17 19:03:41 EDT 2014





----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Monday, March 17, 2014 4:51:11 AM
> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
> 
> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.inoue3 at gmail.com>:
> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvossel at redhat.com>:
> >>
> >>
> >> ----- Original Message -----
> >>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> >>> To: "pm" <pacemaker at oss.clusterlabs.org>
> >>> Sent: Friday, March 14, 2014 5:52:38 AM
> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
> >>>
> >>> Hi,
> >>>
> >>> When specifying the node name in UPPER case and performing
> >>> crm_resource, crmd was aborted.
> >>> (The real node name is a LOWER case.)
> >>
> >> https://github.com/ClusterLabs/pacemaker/pull/462
> >>
> >> does that fix it?
> >>
> >
> > Since behavior of glib is strange somehow, the result is NO.
> > I tested this brunch.
> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault
> > * Red Hat Enterprise Linux Server release 6.4 (Santiago)
> > * glib2-2.22.5-7.el6.x86_64
> >
> > strcase_equal() is not called from g_hash_table_lookup().
> >
> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
> > ...snip...
> > (gdb) b lrm.c:1232
> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
> > (gdb) b strcase_equal
> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
> > (gdb) c
> > Continuing.
> >
> > Breakpoint 1, do_lrm_invoke (action=288230376151711744,
> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
> > msg_data=0x7fff8d679540) at lrm.c:1232
> > 1232        lrm_state = lrm_state_find(target_node);
> > (gdb) s
> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267
> > 267     {
> > (gdb) n
> > 268         if (!node_name) {
> > (gdb) n
> > 271         return g_hash_table_lookup(lrm_state_table, node_name);
> > (gdb) p g_hash_table_size(lrm_state_table)
> > $1 = 1
> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data
> > $2 = 0x1c791a0 "x3650h"
> > (gdb) p node_name
> > $3 = 0x1d4c650 "X3650H"
> > (gdb) n
> > 272     }
> > (gdb) n
> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
> > at lrm.c:1234
> > 1234        if (lrm_state == NULL && is_remote_node) {
> > (gdb) n
> > 1240        CRM_ASSERT(lrm_state != NULL);
> > (gdb) n
> >
> > Program received signal SIGABRT, Aborted.
> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6
> > (gdb)
> >
> >
> > I wonder why... so I will continue investigation.
> >
> >
> 
> I read the code of g_hash_table_lookup().
> Key is compared by the hash value generated by crm_str_hash before
> strcase_equal() is performed.

good catch. I've updated the patch in this pull request. Can you give it a go?

https://github.com/ClusterLabs/pacemaker/pull/462


> 
> *** This is quick-fix solution. ***
> 
>  crmd/lrm_state.c   |    4 ++--
>  include/crm/crm.h  |    2 ++
>  lib/common/utils.c |   11 +++++++++++
>  3 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/crmd/lrm_state.c b/crmd/lrm_state.c
> index d20d74a..ae036fd 100644
> --- a/crmd/lrm_state.c
> +++ b/crmd/lrm_state.c
> @@ -234,13 +234,13 @@ lrm_state_init_local(void)
>      }
> 
>      lrm_state_table =
> -        g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
> internal_lrm_state_destroy);
> +        g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
> internal_lrm_state_destroy);
>      if (!lrm_state_table) {
>          return FALSE;
>      }
> 
>      proxy_table =
> -        g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
> remote_proxy_free);
> +        g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
> remote_proxy_free);
>      if (!proxy_table) {
>           g_hash_table_destroy(lrm_state_table);
>          return FALSE;
> diff --git a/include/crm/crm.h b/include/crm/crm.h
> index b763cc0..46fe5df 100644
> --- a/include/crm/crm.h
> +++ b/include/crm/crm.h
> @@ -195,7 +195,9 @@ typedef GList *GListPtr;
>  #  include <crm/error.h>
> 
>  #  define crm_str_hash g_str_hash_traditional
> +#  define crm_str_hash2 g_str_hash_traditional2
> 
>  guint g_str_hash_traditional(gconstpointer v);
> +guint g_str_hash_traditional2(gconstpointer v);
> 
>  #endif
> diff --git a/lib/common/utils.c b/lib/common/utils.c
> index 29d7965..50fa6c0 100644
> --- a/lib/common/utils.c
> +++ b/lib/common/utils.c
> @@ -2368,6 +2368,17 @@ g_str_hash_traditional(gconstpointer v)
> 
>      return h;
>  }
> +guint
> +g_str_hash_traditional2(gconstpointer v)
> +{
> +    const signed char *p;
> +    guint32 h = 0;
> +
> +    for (p = v; *p != '\0'; p++)
> +        h = (h << 5) - h + g_ascii_tolower(*p);
> +
> +    return h;
> +}
> 
>  void *
>  find_library_function(void **handle, const char *lib, const char *fn,
> gboolean fatal)
> 
> 
> >>> # crm_resource -C -r p1 -N X3650H
> >>> Cleaning up p1 on X3650H
> >>> Waiting for 1 replies from the CRMdNo messages received in 60 seconds..
> >>> aborting
> >>>
> >>> Mar 14 18:33:10 x3650h crmd[10718]:    error: crm_abort:
> >>> do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state !=
> >>> NULL
> >>> ...snip...
> >>> Mar 14 18:33:10 x3650h pacemakerd[10708]:    error: child_waitpid:
> >>> Managed process 10718 (crmd) dumped core
> >>>
> >>>
> >>> * The state before performing crm_resource.
> >>> ----
> >>> Stack: corosync
> >>> Current DC: x3650g (3232261383) - partition with quorum
> >>> Version: 1.1.10-38c5972
> >>> 2 Nodes configured
> >>> 3 Resources configured
> >>>
> >>>
> >>> Online: [ x3650g x3650h ]
> >>>
> >>> Full list of resources:
> >>>
> >>> f-g     (stonith:external/ibmrsa-telnet):       Started x3650h
> >>> f-h     (stonith:external/ibmrsa-telnet):       Started x3650g
> >>> p1      (ocf::pacemaker:Dummy): Stopped
> >>>
> >>> Migration summary:
> >>> * Node x3650g:
> >>> * Node x3650h:
> >>>    p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14
> >>> 18:32:48 2014'
> >>>
> >>> Failed actions:
> >>>     p1_monitor_10000 on x3650h 'not running' (7): call=16,
> >>> status=complete, last-rc-change='Fri Mar 14 18:32:48 2014',
> >>> queued=0ms, exec=0ms
> >>> ----
> >>>
> >>> Just for reference, similar phenomenon did not occur by crm_standby.
> >>> $ crm_standby -U X3650H -v on
> >>>
> >>>
> >>> Best Regards,
> >>> Kazunori INOUE
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://bugs.clusterlabs.org
> >>>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Pacemaker mailing list