[Pacemaker] crmd was aborted at pacemaker 1.1.11

Tue Mar 18 01:30:01 EDT 2014

2014-03-18 8:03 GMT+09:00 David Vossel <dvossel at redhat.com>:
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Monday, March 17, 2014 4:51:11 AM
>> Subject: Re: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>>
>> 2014-03-17 16:37 GMT+09:00 Kazunori INOUE <kazunori.inoue3 at gmail.com>:
>> > 2014-03-15 4:08 GMT+09:00 David Vossel <dvossel at redhat.com>:
>> >>
>> >>
>> >> ----- Original Message -----
>> >>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> >>> To: "pm" <pacemaker at oss.clusterlabs.org>
>> >>> Sent: Friday, March 14, 2014 5:52:38 AM
>> >>> Subject: [Pacemaker] crmd was aborted at pacemaker 1.1.11
>> >>>
>> >>> Hi,
>> >>>
>> >>> When specifying the node name in UPPER case and performing
>> >>> crm_resource, crmd was aborted.
>> >>> (The real node name is a LOWER case.)
>> >>
>> >> https://github.com/ClusterLabs/pacemaker/pull/462
>> >>
>> >> does that fix it?
>> >>
>> >
>> > Since behavior of glib is strange somehow, the result is NO.
>> > I tested this brunch.
>> > https://github.com/davidvossel/pacemaker/tree/lrm-segfault
>> > * Red Hat Enterprise Linux Server release 6.4 (Santiago)
>> > * glib2-2.22.5-7.el6.x86_64
>> >
>> > strcase_equal() is not called from g_hash_table_lookup().
>> >
>> > [x3650h ~]$ gdb /usr/libexec/pacemaker/crmd 17409
>> > ...snip...
>> > (gdb) b lrm.c:1232
>> > Breakpoint 1 at 0x4251d0: file lrm.c, line 1232.
>> > (gdb) b strcase_equal
>> > Breakpoint 2 at 0x429828: file lrm_state.c, line 95.
>> > (gdb) c
>> > Continuing.
>> >
>> > Breakpoint 1, do_lrm_invoke (action=288230376151711744,
>> > cause=C_IPC_MESSAGE, cur_state=S_NOT_DC, current_input=I_ROUTER,
>> > msg_data=0x7fff8d679540) at lrm.c:1232
>> > 1232        lrm_state = lrm_state_find(target_node);
>> > (gdb) s
>> > lrm_state_find (node_name=0x1d4c650 "X3650H") at lrm_state.c:267
>> > 267     {
>> > (gdb) n
>> > 268         if (!node_name) {
>> > (gdb) n
>> > 271         return g_hash_table_lookup(lrm_state_table, node_name);
>> > (gdb) p g_hash_table_size(lrm_state_table)
>> > $1 = 1
>> > (gdb) p (char*)((GList*)g_hash_table_get_keys(lrm_state_table))->data
>> > $2 = 0x1c791a0 "x3650h"
>> > (gdb) p node_name
>> > $3 = 0x1d4c650 "X3650H"
>> > (gdb) n
>> > 272     }
>> > (gdb) n
>> > do_lrm_invoke (action=288230376151711744, cause=C_IPC_MESSAGE,
>> > cur_state=S_NOT_DC, current_input=I_ROUTER, msg_data=0x7fff8d679540)
>> > at lrm.c:1234
>> > 1234        if (lrm_state == NULL && is_remote_node) {
>> > (gdb) n
>> > 1240        CRM_ASSERT(lrm_state != NULL);
>> > (gdb) n
>> >
>> > Program received signal SIGABRT, Aborted.
>> > 0x0000003787e328a5 in raise () from /lib64/libc.so.6
>> > (gdb)
>> >
>> >
>> > I wonder why... so I will continue investigation.
>> >
>> >
>>
>> I read the code of g_hash_table_lookup().
>> Key is compared by the hash value generated by crm_str_hash before
>> strcase_equal() is performed.
>
> good catch. I've updated the patch in this pull request. Can you give it a go?
>
> https://github.com/ClusterLabs/pacemaker/pull/462
>
fail-count is not cleared only in this.

$ crm_resource -C -r p1 -N X3650H
Cleaning up p1 on X3650H
Waiting for 1 replies from the CRMd. OK

$ grep fail-count /var/log/ha-log
Mar 18 13:53:36 x3650g attrd[3610]:    debug: attrd_client_message:
Broadcasting fail-count-p1[X3650H] = (null)
$

$ crm_mon -rf1
Last updated: Tue Mar 18 13:54:51 2014
Last change: Tue Mar 18 13:53:36 2014 by hacluster via crmd on x3650h
Stack: corosync
Current DC: x3650h (3232261384) - partition with quorum
Version: 1.1.10-83553fa
2 Nodes configured
1 Resources configured


Online: [ x3650g x3650h ]

Full list of resources:

 p1     (ocf::pacemaker:Dummy): Stopped

Migration summary:
* Node x3650h:
   p1: migration-threshold=1 fail-count=1 last-failure='Tue Mar 18
13:53:19 2014'
* Node x3650g:
$


So this change also seems to be necessary.

$ git diff --patch-with-stat attrd/commands.c
 attrd/commands.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/attrd/commands.c b/attrd/commands.c
index 985f90c..9f46d92 100644
--- a/attrd/commands.c
+++ b/attrd/commands.c
@@ -145,7 +145,7 @@ create_attribute(xmlNode *xml)
     a->id      = crm_element_value_copy(xml, F_ATTRD_ATTRIBUTE);
     a->set     = crm_element_value_copy(xml, F_ATTRD_SET);
     a->uuid    = crm_element_value_copy(xml, F_ATTRD_KEY);
-    a->values = g_hash_table_new_full(crm_str_hash, g_str_equal,
NULL, free_attribute_value);
+    a->values = g_hash_table_new_full(crm_strcase_hash,
crm_strcase_equal, NULL, free_attribute_value);

 #if ENABLE_ACL
     crm_trace("Performing all %s operations as user '%s'", a->id, a->user);
$

The result is as follows.

$ grep fail-count /var/log/ha-log
Mar 18 13:57:31 x3650g attrd[6688]:    debug: attrd_client_message:
Broadcasting fail-count-p1[X3650H] = (null) (writer)
Mar 18 13:57:31 x3650g attrd[6688]:     info: attrd_peer_update:
Setting fail-count-p1[X3650H]: 1 -> (null) from x3650g
Mar 18 13:57:31 x3650g attrd[6688]:    debug: write_attribute: Update:
x3650h[fail-count-p1]=(null) (3232261384 3232261384 3232261384 x3650h)
Mar 18 13:57:31 x3650g attrd[6688]:   notice: write_attribute: Sent
update 14 with 1 changes for fail-count-p1, id=<n/a>, set=(null)
Mar 18 13:57:31 x3650h attrd[20902]:     info: attrd_peer_update:
Setting fail-count-p1[X3650H]: 1 -> (null) from x3650g
Mar 18 13:57:31 x3650g cib[6685]:     info: cib_perform_op: --
/cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1']
Mar 18 13:57:31 x3650g attrd[6688]:     info: attrd_cib_callback:
Update 14 for fail-count-p1: OK (0)
Mar 18 13:57:31 x3650g attrd[6688]:   notice: attrd_cib_callback:
Update 14 for fail-count-p1[x3650h]=(null): OK (0)
Mar 18 13:57:31 x3650h cib[20899]:     info: cib_perform_op: --
/cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1']
Mar 18 13:57:31 x3650h crmd[20904]:     info: abort_transition_graph:
Transition aborted by deletion of
nvpair[@id='status-3232261384-fail-count-p1']: Transient attribute
change (cib=0.3.18, source=te_update_diff:388,
path=/cib/status/node_state[@id='3232261384']/transient_attributes[@id='3232261384']/instance_attributes[@id='status-3232261384']/nvpair[@id='status-3232261384-fail-count-p1'],
1)


>
>>
>> *** This is quick-fix solution. ***
>>
>>  crmd/lrm_state.c   |    4 ++--
>>  include/crm/crm.h  |    2 ++
>>  lib/common/utils.c |   11 +++++++++++
>>  3 files changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/crmd/lrm_state.c b/crmd/lrm_state.c
>> index d20d74a..ae036fd 100644
>> --- a/crmd/lrm_state.c
>> +++ b/crmd/lrm_state.c
>> @@ -234,13 +234,13 @@ lrm_state_init_local(void)
>>      }
>>
>>      lrm_state_table =
>> -        g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
>> internal_lrm_state_destroy);
>> +        g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
>> internal_lrm_state_destroy);
>>      if (!lrm_state_table) {
>>          return FALSE;
>>      }
>>
>>      proxy_table =
>> -        g_hash_table_new_full(crm_str_hash, strcase_equal, NULL,
>> remote_proxy_free);
>> +        g_hash_table_new_full(crm_str_hash2, strcase_equal, NULL,
>> remote_proxy_free);
>>      if (!proxy_table) {
>>           g_hash_table_destroy(lrm_state_table);
>>          return FALSE;
>> diff --git a/include/crm/crm.h b/include/crm/crm.h
>> index b763cc0..46fe5df 100644
>> --- a/include/crm/crm.h
>> +++ b/include/crm/crm.h
>> @@ -195,7 +195,9 @@ typedef GList *GListPtr;
>>  #  include <crm/error.h>
>>
>>  #  define crm_str_hash g_str_hash_traditional
>> +#  define crm_str_hash2 g_str_hash_traditional2
>>
>>  guint g_str_hash_traditional(gconstpointer v);
>> +guint g_str_hash_traditional2(gconstpointer v);
>>
>>  #endif
>> diff --git a/lib/common/utils.c b/lib/common/utils.c
>> index 29d7965..50fa6c0 100644
>> --- a/lib/common/utils.c
>> +++ b/lib/common/utils.c
>> @@ -2368,6 +2368,17 @@ g_str_hash_traditional(gconstpointer v)
>>
>>      return h;
>>  }
>> +guint
>> +g_str_hash_traditional2(gconstpointer v)
>> +{
>> +    const signed char *p;
>> +    guint32 h = 0;
>> +
>> +    for (p = v; *p != '\0'; p++)
>> +        h = (h << 5) - h + g_ascii_tolower(*p);
>> +
>> +    return h;
>> +}
>>
>>  void *
>>  find_library_function(void **handle, const char *lib, const char *fn,
>> gboolean fatal)
>>
>>
>> >>> # crm_resource -C -r p1 -N X3650H
>> >>> Cleaning up p1 on X3650H
>> >>> Waiting for 1 replies from the CRMdNo messages received in 60 seconds..
>> >>> aborting
>> >>>
>> >>> Mar 14 18:33:10 x3650h crmd[10718]:    error: crm_abort:
>> >>> do_lrm_invoke: Triggered fatal assert at lrm.c:1240 : lrm_state !=
>> >>> NULL
>> >>> ...snip...
>> >>> Mar 14 18:33:10 x3650h pacemakerd[10708]:    error: child_waitpid:
>> >>> Managed process 10718 (crmd) dumped core
>> >>>
>> >>>
>> >>> * The state before performing crm_resource.
>> >>> ----
>> >>> Stack: corosync
>> >>> Current DC: x3650g (3232261383) - partition with quorum
>> >>> Version: 1.1.10-38c5972
>> >>> 2 Nodes configured
>> >>> 3 Resources configured
>> >>>
>> >>>
>> >>> Online: [ x3650g x3650h ]
>> >>>
>> >>> Full list of resources:
>> >>>
>> >>> f-g     (stonith:external/ibmrsa-telnet):       Started x3650h
>> >>> f-h     (stonith:external/ibmrsa-telnet):       Started x3650g
>> >>> p1      (ocf::pacemaker:Dummy): Stopped
>> >>>
>> >>> Migration summary:
>> >>> * Node x3650g:
>> >>> * Node x3650h:
>> >>>    p1: migration-threshold=1 fail-count=1 last-failure='Fri Mar 14
>> >>> 18:32:48 2014'
>> >>>
>> >>> Failed actions:
>> >>>     p1_monitor_10000 on x3650h 'not running' (7): call=16,
>> >>> status=complete, last-rc-change='Fri Mar 14 18:32:48 2014',
>> >>> queued=0ms, exec=0ms
>> >>> ----
>> >>>
>> >>> Just for reference, similar phenomenon did not occur by crm_standby.
>> >>> $ crm_standby -U X3650H -v on
>> >>>
>> >>>
>> >>> Best Regards,
>> >>> Kazunori INOUE
>> >>>
>> >>> _______________________________________________
>> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>
>> >>> Project Home: http://www.clusterlabs.org
>> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >>> Bugs: http://bugs.clusterlabs.org
>> >>>
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>
>> >> Project Home: http://www.clusterlabs.org
>> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> >> Bugs: http://bugs.clusterlabs.org
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org