[Pacemaker] crmd was aborted at pacemaker 1.1.11

Wed Mar 19 05:20:29 EDT 2014

2014-03-19 0:55 GMT+09:00 David Vossel <dvossel at redhat.com>:
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Tuesday, March 18, 2014 12:30:01 AM
>> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>>
>> 2014-03-18 8:03 GMT+09:00 David Vossel <dvossel at redhat.com>:
>> >
>> > ----- Original Message -----
>> >> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> >> To: "The Pacemaker cluster resource manager"
>> >> <pacemaker at oss.clusterlabs.org>
>> >> Sent: Monday, March 17, 2014 4:51:11 AM
>> >> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>> >>
>> >> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.inoue3 at gmail.com>:
>> >> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvossel at redhat.com>:
>> >> >>
>> >> >>
>> >> >> ----- Original Message -----
>> >> >>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> >> >>> To: "pm" <pacemaker at oss.clusterlabs.org>
>> >> >>> Sent: Friday, March 14, 2014 5:52:38 AM
>> >> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>> >> >>>
>> >> >>> Hi,
>> >> >>>
>> >> >>> When specifying the node name in UPPER case and performing
>> >> >>> crm_resource, crmd was aborted.
>> >> >>> (The real node name is a LOWER case.)
>> >> >>
>> >> >> https://github.com/ClusterLabs/pacemaker/pull/462
>> >> >>
>> >> >> does that fix it?
>> >> >>
>> >> >
>> >> > Since behavior of glib is strange somehow, the result is NO.
>> >> > I tested this brunch.
>> >> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault
>> >> > * Red Hat Enterprise Linux Server release 6.4 (Santiago)
>> >> > * glib2-2.22.5-7.el6.x86_64
>> >> >
>> >> > strcase_equal() is not called from g_hash_table_lookup().
>> >> >
>> >> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
>> >> > ...snip...
>> >> > (gdb) b lrm.c:1232
>> >> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
>> >> > (gdb) b strcase_equal
>> >> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
>> >> > (gdb) c
>> >> > Continuing.
>> >> >
>> >> > Breakpoint 1, do_lrm_invoke (action=288230376151711744,
>> >> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
>> >> > msg_data=0x7fff8d679540) at lrm.c:1232
>> >> > 1232        lrm_state = lrm_state_find(target_node);
>> >> > (gdb) s
>> >> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267
>> >> > 267     {
>> >> > (gdb) n
>> >> > 268         if (!node_name) {
>> >> > (gdb) n
>> >> > 271         return g_hash_table_lookup(lrm_state_table, node_name);
>> >> > (gdb) p g_hash_table_size(lrm_state_table)
>> >> > $1 = 1
>> >> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data
>> >> > $2 = 0x1c791a0 "x3650h"
>> >> > (gdb) p node_name
>> >> > $3 = 0x1d4c650 "X3650H"
>> >> > (gdb) n
>> >> > 272     }
>> >> > (gdb) n
>> >> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
>> >> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
>> >> > at lrm.c:1234
>> >> > 1234        if (lrm_state == NULL && is_remote_node) {
>> >> > (gdb) n
>> >> > 1240        CRM_ASSERT(lrm_state != NULL);
>> >> > (gdb) n
>> >> >
>> >> > Program received signal SIGABRT, Aborted.
>> >> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6
>> >> > (gdb)
>> >> >
>> >> >
>> >> > I wonder why... so I will continue investigation.
>> >> >
>> >> >
>> >>
>> >> I read the code of g_hash_table_lookup().
>> >> Key is compared by the hash value generated by crm_str_hash before
>> >> strcase_equal() is performed.
>> >
>> > good catch. I've updated the patch in this pull request. Can you give it a
>> > go?
>> >
>> > https://github.com/ClusterLabs/pacemaker/pull/462
>> >
>> fail-count is not cleared only in this.
>>
>> $ crm_resource -C -r p1 -N X3650H
>> Cleaning up p1 on X3650H
>> Waiting for 1 replies from the CRMd. OK
>>
>> $ grep fail-count /var/log/ha-log
>> Mar 18 13:53:36 x3650g attrd[3610]:    debug: attrd_client_message:
>> Broadcasting fail-count-p1[X3650H] = (null)
>> $
>>
>> $ crm_mon -rf1
>> Last updated: Tue Mar 18 13:54:51 2014
>> Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h
>> Stack: corosync
>> Current DC: x3650h (3232261384) - partition with quorum
>> Version: 1.1.10-83553fa
>> 2 Nodes configured
>> 1 Resources configured
>>
>>
>> Online: [ x3650g x3650h ]
>>
>> Full list of resources:
>>
>>  p1     (ocf::pacemaker:Dummy): Stopped
>>
>> Migration summary:
>> * Node x3650h:
>>    p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18
>> 13:53:19 2014'
>> * Node x3650g:
>> $
>>
>>
>> So this change also seems to be necessary.
>
> yep, added your patch to the pull request
> https://github.com/davidvossel/pacemaker/commit/c118ac5b5244890c19e4c7b2f5a39208d362b61d
>
> I found another one in stonith that I fixed.
>
> https://github.com/ClusterLabs/pacemaker/pull/462
>
> Are we good for merging this now?
>
> -- Vossel
>

I think that you may merge since there is no defect recognized for the moment.

P.S.  I test about some major commands which can specify a node name
from now. It takes one week or more.
 * crm_standby
 * crm_resource
 * crm_failcount
 * and everything else.

If a defect is discovered, I will report it.

Thanks.

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org