[Pacemaker] crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)

Mon Oct 12 17:13:57 EDT 2009

Hi Andrew,

Thank you very much for the help.

Both nodes in our cluster are running pacemaker1.0.5/beartbeat3/RHEL5.3.
I set debug to 3 once in order to get more debug messages. 
Attached contains the ha-debugs and cores from each node.

Ling Li

> From: Andrew Beekhof <andrew at beekhof.net>
> To: pacemaker at oss.clusterlabs.org
> Cc: "pacemaker at oss.clusterlabs.org" <pacemaker at clusterlabs.org>
> Subject: Re: [Pacemaker] crmd killed by signal 11
> 	(pacemaker1.0.5/heartbeat3/RHEL5.3)
> Message-ID:
> 	<b80f82d20910111226r74e3148fg1f845286b2cc34ae at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> 
> On Fri, Oct 9, 2009 at 7:02 AM, Li, Ling (Ling) <lli1 at alcatel-lucent.com>
> wrote:
> > Hi,
> >
> > rpms are downloaded from
> >
> http://download.opensuse.org/repositories/server:/haclustering/RHEL_5/x86_
> 64
> >
> > We have a cluster of two nodes with 17 resources configured.
> > Among those 17 resources, 4 clones, 1 group, and 12 primitives.
> > Cluster options are
> > ? symmetric-cluster=true
> > ? stonith-enabled=false
> > The rest are default.
> >
> >
> > xmllint -relaxng /usr/share/pacemaker/pacemaker.rng cib.xml
> > returns
> > cib.xml validates
> >
> > The cluster works fine except that crmd is killed by signal 11
> sporadically.
> > So far I have the following four causes. The first one is the most
> common one.
> >
> > 1. Core was generated by `/usr/lib64/heartbeat/crmd'.
> > Program terminated with signal 11, Segmentation fault.
> > [New process 2543]
> > #0 ?0x0000000000428d7f in te_graph_trigger ()
> 
> I'd have expected gdb to indicate a line number here.
> Hard to know what the problem might be... do you have the logs for this
> crash?
> 
> > (gdb) bt
> > #0 ?0x0000000000428d7f in te_graph_trigger ()
> > #1 ?0x00002ab6d4a3df63 in crm_trigger_dispatch (source=0xda327a0,
> callback=0x428d34 <te_graph_trigger>,
> > ? ?userdata=0xda327a0) at mainloop.c:53
> > #2 ?0x0000003e8dc2cdb4 in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> > #3 ?0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> > #4 ?0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-
> 2.0.so.0
> > #5 ?0x0000000000405020 in crmd_init ()
> > #6 ?0x0000000000404efb in main ()
> > ---------
> > 2. Program terminated with signal 11, Segmentation fault.
> 
> This one is (now) fixed in
>    http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/2a324fe868c1
> 
> Any reason you're running with such a high debug level?
> 
> > [New process 567]
> > #0 ?0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> > (gdb) bt
> > #0 ?0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> > #1 ?0x00002b71d9ac0bf7 in crm_str_eq (a=0x2b71d9ad85fb "__name__",
> > ? ?b=0x22296c6c <Address 0x22296c6c out of bounds>, use_case=0) at
> utils.c:1848
> > #2 ?0x00002b71d9ac60e4 in log_data_element (function=0x43444f
> "do_lrm_query",
> > ? ?prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0,
> data=0xa96b170, formatted=1)
> > ? ?at xml.c:1175
> > #3 ?0x00002b71d9ac4a44 in print_xml_formatted (log_level=9,
> function=0x43444f "do_lrm_query",
> > ? ?msg=0xa96b170, text=0x4344a1 "Current state of the LRM") at xml.c:775
> > #4 ?0x000000000041b773 in do_lrm_query ()
> > #5 ?0x0000000000413ebc in do_cl_join_finalize_respond ()
> > #6 ?0x0000000000405ad8 in do_fsa_action ()
> > #7 ?0x0000000000406620 in s_crmd_fsa_actions ()
> > #8 ?0x0000000000405f9c in s_crmd_fsa ()
> > #9 ?0x0000000000411b35 in crm_fsa_trigger ()
> > #10 0x00002b71d9ad4f63 in crm_trigger_dispatch (source=0xa95dd70,
> callback=0x411acf <crm_fsa_trigger>,
> > ? ?userdata=0xa95dd70) at mainloop.c:53
> > #11 0x0000003e8dc2cdb4 in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> > #12 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> > #13 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-
> 2.0.so.0
> > #14 0x0000000000405020 in crmd_init ()
> > #15 0x0000000000404efb in main ()
> > -------
> > 3. Core was generated by `/usr/lib64/heartbeat/crmd'.
> 
> This isnt a "crash", Pacemaker has encountered a situation it didn't
> expect.
> When this happens, it saves the program state (by generating a core
> file) and exits so that it can be respawned and try again.
> 
> > Program terminated with signal 6, Aborted.
> > [New process 4101]
> > #0 ?0x0000003839e30215 in raise () from /lib64/libc.so.6
> > (gdb) bt
> > #0 ?0x0000003839e30215 in raise () from /lib64/libc.so.6
> > #1 ?0x0000003839e31cc0 in abort () from /lib64/libc.so.6
> > #2 ?0x00002ba6d6c9238d in crm_abort (file=0x430f89 "election.c",
> > ? ?function=0x430ff0 "do_election_count_vote", line=265,
> > ? ?assert_condition=0x431138 "crm_str_eq(fsa_our_uuid, election_owner,
> TRUE)", do_core=1, do_fork=0)
> > ? ?at utils.c:1375
> > #3 ?0x0000000000412712 in do_election_count_vote ()
> > #4 ?0x0000000000405ad8 in do_fsa_action ()
> > #5 ?0x00000000004066d4 in s_crmd_fsa_actions ()
> > #6 ?0x0000000000405f9c in s_crmd_fsa ()
> > #7 ?0x0000000000411b35 in crm_fsa_trigger ()
> > #8 ?0x00002ba6d6ca7f63 in crm_trigger_dispatch (source=0x143cbd70,
> callback=0x411acf <crm_fsa_trigger>,
> > ? ?userdata=0x143cbd70) at mainloop.c:53
> > #9 ?0x000000383be2cdb4 in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> > #10 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> > #11 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-
> 2.0.so.0
> > #12 0x0000000000405020 in crmd_init ()
> > #13 0x0000000000404efb in main ()
> > ---
> > 4. Core was generated by `/usr/lib64/heartbeat/crmd'
> 
> This is the same as 2. which is now fixed.
> 
> > Program terminated with signal 11, Segmentation fault.
> > [New process 6817]
> > #0 ?0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> > (gdb) bt
> > #0 ?0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> > #1 ?0x00002b57db35dbf7 in crm_str_eq (a=0x2b57db3755fb "__name__",
> > ? ?b=0xb96ac02022296c6c <Address 0xb96ac02022296c6c out of bounds>,
> use_case=0) at utils.c:1848
> > #2 ?0x00002b57db3630e4 in log_data_element (function=0x43444f
> "do_lrm_query",
> > ? ?prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0,
> data=0x6b95f00, formatted=1)
> > ? ?at xml.c:1175
> > #3 ?0x00002b57db361a44 in print_xml_formatted (log_level=9,
> function=0x43444f "do_lrm_query",
> > ? ?msg=0x6b95f00, text=0x4344a1 "Current state of the LRM") at xml.c:775
> > #4 ?0x000000000041b773 in do_lrm_query ()
> > #5 ?0x0000000000413ebc in do_cl_join_finalize_respond ()
> > #6 ?0x0000000000405ad8 in do_fsa_action ()
> > #7 ?0x0000000000406620 in s_crmd_fsa_actions ()
> > #8 ?0x0000000000405f9c in s_crmd_fsa ()
> > #9 ?0x0000000000411b35 in crm_fsa_trigger ()
> > #10 0x00002b57db371f63 in crm_trigger_dispatch (source=0x6b91d70,
> callback=0x411acf <crm_fsa_trigger>,
> > ? ?userdata=0x6b91d70) at mainloop.c:53
> > #11 0x000000383be2cdb4 in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> > #12 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> > #13 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-
> 2.0.so.0
> > #14 0x0000000000405020 in crmd_init ()
> > #15 0x0000000000404efb in main ()
> > ----
> >
> >
> > I ran crm_verify -VVVVV -L
> > The output has
> > 1. for each line of cib.xml there is a message :" debug: debug2:
> log_data_element: get_xpath_object: Bad Input".
> > 2. many find_xml_node such as
> > ? find_xml_node: Could not find operations in clone.
> > ? find_xml_node: Could not find group in clone.
> > ? Find_xml_node: Could not find operations in primitive. (but each
> primitive has a monitor operation)
> >
> > 3. Warnings found during check: config may not be valid
> >
> > My questions:
> >
> > 1. Can I ignore 1 and 2 since cib.xml passed the xmllint validation?
> > 2. which tool can I use to make sure the cib.xml is absolute correct?
> >
> > Thanks,
> >
> > Ling Li
> >
> >
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> 
> End of Pacemaker Digest, Vol 23, Issue 25
> *****************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: crmdReboot.tar.gz
Type: application/x-gzip
Size: 7620435 bytes
Desc: crmdReboot.tar.gz
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20091012/574199aa/attachment-0001.bin>