[Pacemaker] crmd was aborted at pacemaker 1.1.11

Mon Mar 17 03:37:28 EDT 2014

2014-03-15 4:08 GMT+09:00 David Vossel <dvossel at redhat.com>:
>
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> To: "pm" <pacemaker at oss.clusterlabs.org>
>> Sent: Friday, March 14, 2014 5:52:38 AM
>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>>
>> Hi,
>>
>> When specifying the node name in UPPER case and performing
>> crm_resource, crmd was aborted.
>> (The real node name is a LOWER case.)
>
> https://github.com/ClusterLabs/pacemaker/pull/462
>
> does that fix it?
>

Since behavior of glib is strange somehow, the result is NO.
I tested this brunch.
https://github.com/davidvossel/pacemaker/tree/lrm-segfault
* Red Hat Enterprise Linux Server release 6.4 (Santiago)
* glib2-2.22.5-7.el6.x86_64

strcase_equal() is not called from g_hash_table_lookup().

[x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
...snip...
(gdb) b lrm.c:1232
Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
(gdb) b strcase_equal
Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
(gdb) c
Continuing.

Breakpoint 1, do_lrm_invoke (action=288230376151711744,
cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
msg_data=0x7fff8d679540) at lrm.c:1232
1232        lrm_state = lrm_state_find(target_node);
(gdb) s
lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267
267     {
(gdb) n
268         if (!node_name) {
(gdb) n
271         return g_hash_table_lookup(lrm_state_table, node_name);
(gdb) p g_hash_table_size(lrm_state_table)
$1 = 1
(gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data
$2 = 0x1c791a0 "x3650h"
(gdb) p node_name
$3 = 0x1d4c650 "X3650H"
(gdb) n
272     }
(gdb) n
do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
at lrm.c:1234
1234        if (lrm_state == NULL && is_remote_node) {
(gdb) n
1240        CRM_ASSERT(lrm_state != NULL);
(gdb) n

Program received signal SIGABRT, Aborted.
0x0000003787e328a5 in raise () from /lib64/libc.so.6
(gdb)

I wonder why... so I will continue investigation.

>> # crm_resource -C -r p1 -N X3650H
>> Cleaning up p1 on X3650H
>> Waiting for 1 replies from the CRMdNo messages received in 60 seconds..
>> aborting
>>
>> Mar 14 18:33:10 x3650h crmd[10718]:    error: crm_abort:
>> do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state !=
>> NULL
>> ...snip...
>> Mar 14 18:33:10 x3650h pacemakerd[10708]:    error: child_waitpid:
>> Managed process 10718 (crmd) dumped core
>>
>>
>> * The state before performing crm_resource.
>> ----
>> Stack: corosync
>> Current DC: x3650g (3232261383) - partition with quorum
>> Version: 1.1.10-38c5972
>> 2 Nodes configured
>> 3 Resources configured
>>
>>
>> Online: [ x3650g x3650h ]
>>
>> Full list of resources:
>>
>> f-g     (stonith:external/ibmrsa-telnet):       Started x3650h
>> f-h     (stonith:external/ibmrsa-telnet):       Started x3650g
>> p1      (ocf::pacemaker:Dummy): Stopped
>>
>> Migration summary:
>> * Node x3650g:
>> * Node x3650h:
>>    p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14
>> 18:32:48 2014'
>>
>> Failed actions:
>>     p1_monitor_10000 on x3650h 'not running' (7): call=16,
>> status=complete, last-rc-change='Fri Mar 14 18:32:48 2014',
>> queued=0ms, exec=0ms
>> ----
>>
>> Just for reference, similar phenomenon did not occur by crm_standby.
>> $ crm_standby -U X3650H -v on
>>
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org