[Pacemaker] crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)

Wed Oct 28 16:57:58 EDT 2009

Someone else also reported this.
Could you possibly try this patch?
   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/e1696316f46e

On Tue, Oct 13, 2009 at 10:21 PM, Li, Ling (Ling)
<lli1 at alcatel-lucent.com> wrote:
> Hi Andrew,
>
> Thanks for the quick response.
> Yes, the attachment contains three core files: one from node17(amm5000a) and two from node86(esm1000havat1) between Oct 12 03:00:00 and 05:00:00. The gdb's bt output of the three core files are the same from all the thee corefiles (actually from most of core dumps I got so far):
> Program terminated with signal 11, Segmentation fault.
> [New process 26763]
> #0  0x0000000000428d7f in te_graph_trigger ()
> (gdb) bt
> #0  0x0000000000428d7f in te_graph_trigger ()
> #1  0x00002b259a2dcf63 in crm_trigger_dispatch (source=0x14964840, callback=0x428d34 <te_graph_trigger>,
>    userdata=0x14964840) at mainloop.c:53
> #2  0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #3  0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #4  0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #5  0x0000000000405020 in crmd_init ()
> #6  0x0000000000404efb in main ()
>
> One thing I found consistent is that if bt shows the above back-trace, the last API message before the crmd killed by 11 is always the following:
>
> Search SIGSEG in both ha-debug.reboot files, right before crmd killed by SIGSEGV 11, Ha-debug shows that Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: APIclients_input_dispatch() {
>  213422 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: ProcessAnAPIRequest() {
>  213423 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: Sending API message to cluster...
>  213424 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG: Dumping message with 16 fields
>  213425 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[0] : [__name__=create_request_adv]
>  213426 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[1] : [origin=do_election_count_vote]
>  213427 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[2] : [t=crmd]
>  213428 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[3] : [version=3.0.1]
>  213429 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[4] : [subt=request]  213430 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[5] : [reference=no-vote-crmd-1255318694-247]
>  213431 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[6] : [crm_task=no-vote]
>  213432 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[7] : [crm_sys_to=crmd]
>  213433 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[8] : [crm_sys_from=crmd]
>  213434 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[9] : [crm_host_to=amm5000a]
>  213435 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[10] : [election-owner=d8fa36f2-e8a2-4771-8054-cf        a50cc2a100]
>  213436 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[11] : [election-id=3]
>  213437 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[12] : [dest=amm5000a]
>  213438 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[13] : [oseq=56]
>  213439 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[14] : [from_id=crmd]  213440 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: MSG[15] : [to_id=crmd]
>  213441 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: process_clustermsg: node [esm1000havat1]
>  213442 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug:         return TRUE;
>  213443 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: }/*ProcessAnAPIRequest*/;
>  213444 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: return 1;
>  213445 Oct 12 03:38:14 esm1000havat1 heartbeat: [6861]: debug: }/*APIclients_input_dispatch*/;
>
>
> I will run hb_report when the crmd reboot happen again.
>
> Thanks,
>
> Ling Li
>
>
>> -----Original Message-----
>> From: Andrew Beekhof [mailto:andrew at beekhof.net]
>> Sent: Tuesday, October 13, 2009 2:07 PM
>> To: Li, Ling (Ling)
>> Subject: Re: crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)
>>
>>
>> On Oct 13, 2009, at 6:23 PM, Li, Ling (Ling) wrote:
>>
>> > Hi Andrew,
>> >
>> > My attempt to send the attachment of size 7M to andre at beekhof.net
>> > failed too.
>>
>> You're missing a 'w' :-)
>>
>> > Is there any other way I can send you the attachment?
>>
>> Create a bug is probably the easiest way.
>> But you're not actually attaching the core files are you?  I can't do
>> anything with them.
>> Run hb_report instead.
>>
>> >
>> > Thanks,
>> >
>> > Ling Li
>> >
>> >> -----Original Message-----
>> >> From: Li, Ling (Ling)
>> >> Sent: Tuesday, October 13, 2009 12:13 PM
>> >> To: 'andrew at beekhof.net'
>> >> Cc: Li, Ling (Ling)
>> >> Subject: FW: crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/
>> >> RHEL5.3)
>> >>
>> >> Hi Andrew,
>> >>
>> >> I sent the attachment to pacemaker at oss.clusterlabs.org yesterday.
>> >> But I
>> >> did not see the email in the
>> oss.clusterlabs.org/pipermail/pacemaker/2009-
>> >> Octor/.  Not sure if it got rejected because of the size of
>> >> attachment or
>> >> not.
>> >> I resent it to you via the different email addr. Hope you get it this
>> >> time.
>> >>
>> >> Thanks,
>> >>
>> >> Ling Li
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Li, Ling (Ling)
>> >> Sent: Monday, October 12, 2009 5:14 PM
>> >> To: 'pacemaker at oss.clusterlabs.org'
>> >> Subject: Re: crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/
>> >> RHEL5.3)
>> >>
>> >>
>> >> Hi Andrew,
>> >>
>> >> Thank you very much for the help.
>> >>
>> >> Both nodes in our cluster are running pacemaker1.0.5/beartbeat3/
>> >> RHEL5.3.
>> >> I set debug to 3 once in order to get more debug messages.
>> >> Attached contains the ha-debugs and cores from each node.
>> >>
>> >> Ling Li
>> >>
>> >>> From: Andrew Beekhof <andrew at beekhof.net>
>> >>> To: pacemaker at oss.clusterlabs.org
>> >>> Cc: "pacemaker at oss.clusterlabs.org" <pacemaker at clusterlabs.org>
>> >>> Subject: Re: [Pacemaker] crmd killed by signal 11
>> >>>   (pacemaker1.0.5/heartbeat3/RHEL5.3)
>> >>> Message-ID:
>> >>>   <b80f82d20910111226r74e3148fg1f845286b2cc34ae at mail.gmail.com>
>> >>> Content-Type: text/plain; charset=ISO-8859-1
>> >>>
>> >>> On Fri, Oct 9, 2009 at 7:02 AM, Li, Ling (Ling) <lli1 at alcatel-
>> >> lucent.com>
>> >>> wrote:
>> >>>> Hi,
>> >>>>
>> >>>> rpms are downloaded from
>> >>>>
>> >>>
>> >>
>> http://download.opensuse.org/repositories/server:/haclustering/RHEL_5/x86_
>> >>> 64
>> >>>>
>> >>>> We have a cluster of two nodes with 17 resources configured.
>> >>>> Among those 17 resources, 4 clones, 1 group, and 12 primitives.
>> >>>> Cluster options are
>> >>>> ? symmetric-cluster=true
>> >>>> ? stonith-enabled=false
>> >>>> The rest are default.
>> >>>>
>> >>>>
>> >>>> xmllint -relaxng /usr/share/pacemaker/pacemaker.rng cib.xml
>> >>>> returns
>> >>>> cib.xml validates
>> >>>>
>> >>>> The cluster works fine except that crmd is killed by signal 11
>> >>> sporadically.
>> >>>> So far I have the following four causes. The first one is the most
>> >>> common one.
>> >>>>
>> >>>> 1. Core was generated by `/usr/lib64/heartbeat/crmd'.
>> >>>> Program terminated with signal 11, Segmentation fault.
>> >>>> [New process 2543]
>> >>>> #0 ?0x0000000000428d7f in te_graph_trigger ()
>> >>>
>> >>> I'd have expected gdb to indicate a line number here.
>> >>> Hard to know what the problem might be... do you have the logs for
>> >>> this
>> >>> crash?
>> >>>
>> >>>> (gdb) bt
>> >>>> #0 ?0x0000000000428d7f in te_graph_trigger ()
>> >>>> #1 ?0x00002ab6d4a3df63 in crm_trigger_dispatch (source=0xda327a0,
>> >>> callback=0x428d34 <te_graph_trigger>,
>> >>>> ? ?userdata=0xda327a0) at mainloop.c:53
>> >>>> #2 ?0x0000003e8dc2cdb4 in g_main_context_dispatch () from
>> >>> /lib64/libglib-2.0.so.0
>> >>>> #3 ?0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
>> >>>> #4 ?0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-
>> >>> 2.0.so.0
>> >>>> #5 ?0x0000000000405020 in crmd_init ()
>> >>>> #6 ?0x0000000000404efb in main ()
>> >>>> ---------
>> >>>> 2. Program terminated with signal 11, Segmentation fault.
>> >>>
>> >>> This one is (now) fixed in
>> >>>   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/2a324fe868c1
>> >>>
>> >>> Any reason you're running with such a high debug level?
>> >>>
>> >>>> [New process 567]
>> >>>> #0 ?0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
>> >>>> (gdb) bt
>> >>>> #0 ?0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
>> >>>> #1 ?0x00002b71d9ac0bf7 in crm_str_eq (a=0x2b71d9ad85fb "__name__",
>> >>>> ? ?b=0x22296c6c <Address 0x22296c6c out of bounds>, use_case=0) at
>> >>> utils.c:1848
>> >>>> #2 ?0x00002b71d9ac60e4 in log_data_element (function=0x43444f
>> >>> "do_lrm_query",
>> >>>> ? ?prefix=0x4344a1 "Current state of the LRM", log_level=9,
>> >>>> depth=0,
>> >>> data=0xa96b170, formatted=1)
>> >>>> ? ?at xml.c:1175
>> >>>> #3 ?0x00002b71d9ac4a44 in print_xml_formatted (log_level=9,
>> >>> function=0x43444f "do_lrm_query",
>> >>>> ? ?msg=0xa96b170, text=0x4344a1 "Current state of the LRM") at
>> >> xml.c:775
>> >>>> #4 ?0x000000000041b773 in do_lrm_query ()
>> >>>> #5 ?0x0000000000413ebc in do_cl_join_finalize_respond ()
>> >>>> #6 ?0x0000000000405ad8 in do_fsa_action ()
>> >>>> #7 ?0x0000000000406620 in s_crmd_fsa_actions ()
>> >>>> #8 ?0x0000000000405f9c in s_crmd_fsa ()
>> >>>> #9 ?0x0000000000411b35 in crm_fsa_trigger ()
>> >>>> #10 0x00002b71d9ad4f63 in crm_trigger_dispatch (source=0xa95dd70,
>> >>> callback=0x411acf <crm_fsa_trigger>,
>> >>>> ? ?userdata=0xa95dd70) at mainloop.c:53
>> >>>> #11 0x0000003e8dc2cdb4 in g_main_context_dispatch () from
>> >>> /lib64/libglib-2.0.so.0
>> >>>> #12 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
>> >>>> #13 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-
>> >>> 2.0.so.0
>> >>>> #14 0x0000000000405020 in crmd_init ()
>> >>>> #15 0x0000000000404efb in main ()
>> >>>> -------
>> >>>> 3. Core was generated by `/usr/lib64/heartbeat/crmd'.
>> >>>
>> >>> This isnt a "crash", Pacemaker has encountered a situation it didn't
>> >>> expect.
>> >>> When this happens, it saves the program state (by generating a core
>> >>> file) and exits so that it can be respawned and try again.
>> >>>
>> >>>> Program terminated with signal 6, Aborted.
>> >>>> [New process 4101]
>> >>>> #0 ?0x0000003839e30215 in raise () from /lib64/libc.so.6
>> >>>> (gdb) bt
>> >>>> #0 ?0x0000003839e30215 in raise () from /lib64/libc.so.6
>> >>>> #1 ?0x0000003839e31cc0 in abort () from /lib64/libc.so.6
>> >>>> #2 ?0x00002ba6d6c9238d in crm_abort (file=0x430f89 "election.c",
>> >>>> ? ?function=0x430ff0 "do_election_count_vote", line=265,
>> >>>> ? ?assert_condition=0x431138 "crm_str_eq(fsa_our_uuid,
>> >>>> election_owner,
>> >>> TRUE)", do_core=1, do_fork=0)
>> >>>> ? ?at utils.c:1375
>> >>>> #3 ?0x0000000000412712 in do_election_count_vote ()
>> >>>> #4 ?0x0000000000405ad8 in do_fsa_action ()
>> >>>> #5 ?0x00000000004066d4 in s_crmd_fsa_actions ()
>> >>>> #6 ?0x0000000000405f9c in s_crmd_fsa ()
>> >>>> #7 ?0x0000000000411b35 in crm_fsa_trigger ()
>> >>>> #8 ?0x00002ba6d6ca7f63 in crm_trigger_dispatch (source=0x143cbd70,
>> >>> callback=0x411acf <crm_fsa_trigger>,
>> >>>> ? ?userdata=0x143cbd70) at mainloop.c:53
>> >>>> #9 ?0x000000383be2cdb4 in g_main_context_dispatch () from
>> >>> /lib64/libglib-2.0.so.0
>> >>>> #10 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
>> >>>> #11 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-
>> >>> 2.0.so.0
>> >>>> #12 0x0000000000405020 in crmd_init ()
>> >>>> #13 0x0000000000404efb in main ()
>> >>>> ---
>> >>>> 4. Core was generated by `/usr/lib64/heartbeat/crmd'
>> >>>
>> >>> This is the same as 2. which is now fixed.
>> >>>
>> >>>> Program terminated with signal 11, Segmentation fault.
>> >>>> [New process 6817]
>> >>>> #0 ?0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
>> >>>> (gdb) bt
>> >>>> #0 ?0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
>> >>>> #1 ?0x00002b57db35dbf7 in crm_str_eq (a=0x2b57db3755fb "__name__",
>> >>>> ? ?b=0xb96ac02022296c6c <Address 0xb96ac02022296c6c out of bounds>,
>> >>> use_case=0) at utils.c:1848
>> >>>> #2 ?0x00002b57db3630e4 in log_data_element (function=0x43444f
>> >>> "do_lrm_query",
>> >>>> ? ?prefix=0x4344a1 "Current state of the LRM", log_level=9,
>> >>>> depth=0,
>> >>> data=0x6b95f00, formatted=1)
>> >>>> ? ?at xml.c:1175
>> >>>> #3 ?0x00002b57db361a44 in print_xml_formatted (log_level=9,
>> >>> function=0x43444f "do_lrm_query",
>> >>>> ? ?msg=0x6b95f00, text=0x4344a1 "Current state of the LRM") at
>> >> xml.c:775
>> >>>> #4 ?0x000000000041b773 in do_lrm_query ()
>> >>>> #5 ?0x0000000000413ebc in do_cl_join_finalize_respond ()
>> >>>> #6 ?0x0000000000405ad8 in do_fsa_action ()
>> >>>> #7 ?0x0000000000406620 in s_crmd_fsa_actions ()
>> >>>> #8 ?0x0000000000405f9c in s_crmd_fsa ()
>> >>>> #9 ?0x0000000000411b35 in crm_fsa_trigger ()
>> >>>> #10 0x00002b57db371f63 in crm_trigger_dispatch (source=0x6b91d70,
>> >>> callback=0x411acf <crm_fsa_trigger>,
>> >>>> ? ?userdata=0x6b91d70) at mainloop.c:53
>> >>>> #11 0x000000383be2cdb4 in g_main_context_dispatch () from
>> >>> /lib64/libglib-2.0.so.0
>> >>>> #12 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
>> >>>> #13 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-
>> >>> 2.0.so.0
>> >>>> #14 0x0000000000405020 in crmd_init ()
>> >>>> #15 0x0000000000404efb in main ()
>> >>>> ----
>> >>>>
>> >>>>
>> >>>> I ran crm_verify -VVVVV -L
>> >>>> The output has
>> >>>> 1. for each line of cib.xml there is a message :" debug: debug2:
>> >>> log_data_element: get_xpath_object: Bad Input".
>> >>>> 2. many find_xml_node such as
>> >>>> ? find_xml_node: Could not find operations in clone.
>> >>>> ? find_xml_node: Could not find group in clone.
>> >>>> ? Find_xml_node: Could not find operations in primitive. (but each
>> >>> primitive has a monitor operation)
>> >>>>
>> >>>> 3. Warnings found during check: config may not be valid
>> >>>>
>> >>>> My questions:
>> >>>>
>> >>>> 1. Can I ignore 1 and 2 since cib.xml passed the xmllint
>> >>>> validation?
>> >>>> 2. which tool can I use to make sure the cib.xml is absolute
>> >>>> correct?
>> >>>>
>> >>>> Thanks,
>> >>>>
>> >>>> Ling Li
>> >>>>
>> >>>>
>> >>>> _______________________________________________
>> >>>> Pacemaker mailing list
>> >>>> Pacemaker at oss.clusterlabs.org
>> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> ------------------------------
>> >>>
>> >>> _______________________________________________
>> >>> Pacemaker mailing list
>> >>> Pacemaker at oss.clusterlabs.org
>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >>>
>> >>>
>> >>> End of Pacemaker Digest, Vol 23, Issue 25
>> >>> *****************************************
>>
>> -- Andrew
>>
>>
>
>