[Pacemaker] crmd was aborted at pacemaker 1.1.11

Tue Mar 18 11:55:10 EDT 2014

----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, March 18, 2014 12:30:01 AM
> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
> 
> 2014-03-18 8:03 GMT+09:00 David Vossel <dvossel at redhat.com>:
> >
> > ----- Original Message -----
> >> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> >> To: "The Pacemaker cluster resource manager"
> >> <pacemaker at oss.clusterlabs.org>
> >> Sent: Monday, March 17, 2014 4:51:11 AM
> >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
> >>
> >> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.inoue3 at gmail.com>:
> >> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvossel at redhat.com>:
> >> >>
> >> >>
> >> >> ----- Original Message -----
> >> >>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> >> >>> To: "pm" <pacemaker at oss.clusterlabs.org>
> >> >>> Sent: Friday, March 14, 2014 5:52:38 AM
> >> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> When specifying the node name in UPPER case and performing
> >> >>> crm_resource, crmd was aborted.
> >> >>> (The real node name is a LOWER case.)
> >> >>
> >> >> https://github.com/ClusterLabs/pacemaker/pull/462
> >> >>
> >> >> does that fix it?
> >> >>
> >> >
> >> > Since behavior of glib is strange somehow, the result is NO.
> >> > I tested this brunch.
> >> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault
> >> > * Red Hat Enterprise Linux Server release 6.4 (Santiago)
> >> > * glib2-2.22.5-7.el6.x86_64
> >> >
> >> > strcase_equal() is not called from g_hash_table_lookup().
> >> >
> >> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
> >> > ...snip...
> >> > (gdb) b lrm.c:1232
> >> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
> >> > (gdb) b strcase_equal
> >> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
> >> > (gdb) c
> >> > Continuing.
> >> >
> >> > Breakpoint 1, do_lrm_invoke (action=288230376151711744,
> >> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
> >> > msg_data=0x7fff8d679540) at lrm.c:1232
> >> > 1232        lrm_state = lrm_state_find(target_node);
> >> > (gdb) s
> >> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267
> >> > 267     {
> >> > (gdb) n
> >> > 268         if (!node_name) {
> >> > (gdb) n
> >> > 271         return g_hash_table_lookup(lrm_state_table, node_name);
> >> > (gdb) p g_hash_table_size(lrm_state_table)
> >> > $1 = 1
> >> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data
> >> > $2 = 0x1c791a0 "x3650h"
> >> > (gdb) p node_name
> >> > $3 = 0x1d4c650 "X3650H"
> >> > (gdb) n
> >> > 272     }
> >> > (gdb) n
> >> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
> >> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
> >> > at lrm.c:1234
> >> > 1234        if (lrm_state == NULL && is_remote_node) {
> >> > (gdb) n
> >> > 1240        CRM_ASSERT(lrm_state != NULL);
> >> > (gdb) n
> >> >
> >> > Program received signal SIGABRT, Aborted.
> >> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6
> >> > (gdb)
> >> >
> >> >
> >> > I wonder why... so I will continue investigation.
> >> >
> >> >
> >>
> >> I read the code of g_hash_table_lookup().
> >> Key is compared by the hash value generated by crm_str_hash before
> >> strcase_equal() is performed.
> >
> > good catch. I've updated the patch in this pull request. Can you give it a
> > go?
> >
> > https://github.com/ClusterLabs/pacemaker/pull/462
> >
> fail-count is not cleared only in this.
> 
> $ crm_resource -C -r p1 -N X3650H
> Cleaning up p1 on X3650H
> Waiting for 1 replies from the CRMd. OK
> 
> $ grep fail-count /var/log/ha-log
> Mar 18 13:53:36 x3650g attrd[3610]:    debug: attrd_client_message:
> Broadcasting fail-count-p1[X3650H] = (null)
> $
> 
> $ crm_mon -rf1
> Last updated: Tue Mar 18 13:54:51 2014
> Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h
> Stack: corosync
> Current DC: x3650h (3232261384) - partition with quorum
> Version: 1.1.10-83553fa
> 2 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ x3650g x3650h ]
> 
> Full list of resources:
> 
>  p1     (ocf::pacemaker:Dummy): Stopped
> 
> Migration summary:
> * Node x3650h:
>    p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18
> 13:53:19 2014'
> * Node x3650g:
> $
> 
> 
> So this change also seems to be necessary.

yep, added your patch to the pull request
https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d

I found another one in stonith that I fixed.

https://github.com/ClusterLabs/pacemaker/pull/462

Are we good for merging this now?

-- Vossel