[Pacemaker] crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)
Andrew Beekhof
andrew at beekhof.net
Sun Oct 11 15:26:47 EDT 2009
On Fri, Oct 9, 2009 at 7:02 AM, Li, Ling (Ling) <lli1 at alcatel-lucent.com> wrote:
> Hi,
>
> rpms are downloaded from
> http://download.opensuse.org/repositories/server:/haclustering/RHEL_5/x86_64
>
> We have a cluster of two nodes with 17 resources configured.
> Among those 17 resources, 4 clones, 1 group, and 12 primitives.
> Cluster options are
> symmetric-cluster=true
> stonith-enabled=false
> The rest are default.
>
>
> xmllint -relaxng /usr/share/pacemaker/pacemaker.rng cib.xml
> returns
> cib.xml validates
>
> The cluster works fine except that crmd is killed by signal 11 sporadically.
> So far I have the following four causes. The first one is the most common one.
>
> 1. Core was generated by `/usr/lib64/heartbeat/crmd'.
> Program terminated with signal 11, Segmentation fault.
> [New process 2543]
> #0 0x0000000000428d7f in te_graph_trigger ()
I'd have expected gdb to indicate a line number here.
Hard to know what the problem might be... do you have the logs for this crash?
> (gdb) bt
> #0 0x0000000000428d7f in te_graph_trigger ()
> #1 0x00002ab6d4a3df63 in crm_trigger_dispatch (source=0xda327a0, callback=0x428d34 <te_graph_trigger>,
> userdata=0xda327a0) at mainloop.c:53
> #2 0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #3 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #4 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #5 0x0000000000405020 in crmd_init ()
> #6 0x0000000000404efb in main ()
> ---------
> 2. Program terminated with signal 11, Segmentation fault.
This one is (now) fixed in
http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/2a324fe868c1
Any reason you're running with such a high debug level?
> [New process 567]
> #0 0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> (gdb) bt
> #0 0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
> #1 0x00002b71d9ac0bf7 in crm_str_eq (a=0x2b71d9ad85fb "__name__",
> b=0x22296c6c <Address 0x22296c6c out of bounds>, use_case=0) at utils.c:1848
> #2 0x00002b71d9ac60e4 in log_data_element (function=0x43444f "do_lrm_query",
> prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0xa96b170, formatted=1)
> at xml.c:1175
> #3 0x00002b71d9ac4a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
> msg=0xa96b170, text=0x4344a1 "Current state of the LRM") at xml.c:775
> #4 0x000000000041b773 in do_lrm_query ()
> #5 0x0000000000413ebc in do_cl_join_finalize_respond ()
> #6 0x0000000000405ad8 in do_fsa_action ()
> #7 0x0000000000406620 in s_crmd_fsa_actions ()
> #8 0x0000000000405f9c in s_crmd_fsa ()
> #9 0x0000000000411b35 in crm_fsa_trigger ()
> #10 0x00002b71d9ad4f63 in crm_trigger_dispatch (source=0xa95dd70, callback=0x411acf <crm_fsa_trigger>,
> userdata=0xa95dd70) at mainloop.c:53
> #11 0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #13 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #14 0x0000000000405020 in crmd_init ()
> #15 0x0000000000404efb in main ()
> -------
> 3. Core was generated by `/usr/lib64/heartbeat/crmd'.
This isnt a "crash", Pacemaker has encountered a situation it didn't expect.
When this happens, it saves the program state (by generating a core
file) and exits so that it can be respawned and try again.
> Program terminated with signal 6, Aborted.
> [New process 4101]
> #0 0x0000003839e30215 in raise () from /lib64/libc.so.6
> (gdb) bt
> #0 0x0000003839e30215 in raise () from /lib64/libc.so.6
> #1 0x0000003839e31cc0 in abort () from /lib64/libc.so.6
> #2 0x00002ba6d6c9238d in crm_abort (file=0x430f89 "election.c",
> function=0x430ff0 "do_election_count_vote", line=265,
> assert_condition=0x431138 "crm_str_eq(fsa_our_uuid, election_owner, TRUE)", do_core=1, do_fork=0)
> at utils.c:1375
> #3 0x0000000000412712 in do_election_count_vote ()
> #4 0x0000000000405ad8 in do_fsa_action ()
> #5 0x00000000004066d4 in s_crmd_fsa_actions ()
> #6 0x0000000000405f9c in s_crmd_fsa ()
> #7 0x0000000000411b35 in crm_fsa_trigger ()
> #8 0x00002ba6d6ca7f63 in crm_trigger_dispatch (source=0x143cbd70, callback=0x411acf <crm_fsa_trigger>,
> userdata=0x143cbd70) at mainloop.c:53
> #9 0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #10 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #11 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #12 0x0000000000405020 in crmd_init ()
> #13 0x0000000000404efb in main ()
> ---
> 4. Core was generated by `/usr/lib64/heartbeat/crmd'
This is the same as 2. which is now fixed.
> Program terminated with signal 11, Segmentation fault.
> [New process 6817]
> #0 0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> (gdb) bt
> #0 0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
> #1 0x00002b57db35dbf7 in crm_str_eq (a=0x2b57db3755fb "__name__",
> b=0xb96ac02022296c6c <Address 0xb96ac02022296c6c out of bounds>, use_case=0) at utils.c:1848
> #2 0x00002b57db3630e4 in log_data_element (function=0x43444f "do_lrm_query",
> prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0x6b95f00, formatted=1)
> at xml.c:1175
> #3 0x00002b57db361a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
> msg=0x6b95f00, text=0x4344a1 "Current state of the LRM") at xml.c:775
> #4 0x000000000041b773 in do_lrm_query ()
> #5 0x0000000000413ebc in do_cl_join_finalize_respond ()
> #6 0x0000000000405ad8 in do_fsa_action ()
> #7 0x0000000000406620 in s_crmd_fsa_actions ()
> #8 0x0000000000405f9c in s_crmd_fsa ()
> #9 0x0000000000411b35 in crm_fsa_trigger ()
> #10 0x00002b57db371f63 in crm_trigger_dispatch (source=0x6b91d70, callback=0x411acf <crm_fsa_trigger>,
> userdata=0x6b91d70) at mainloop.c:53
> #11 0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> #12 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
> #13 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #14 0x0000000000405020 in crmd_init ()
> #15 0x0000000000404efb in main ()
> ----
>
>
> I ran crm_verify -VVVVV -L
> The output has
> 1. for each line of cib.xml there is a message :" debug: debug2: log_data_element: get_xpath_object: Bad Input".
> 2. many find_xml_node such as
> find_xml_node: Could not find operations in clone.
> find_xml_node: Could not find group in clone.
> Find_xml_node: Could not find operations in primitive. (but each primitive has a monitor operation)
>
> 3. Warnings found during check: config may not be valid
>
> My questions:
>
> 1. Can I ignore 1 and 2 since cib.xml passed the xmllint validation?
> 2. which tool can I use to make sure the cib.xml is absolute correct?
>
> Thanks,
>
> Ling Li
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
More information about the Pacemaker
mailing list