[Pacemaker] crmd killed by signal 11 (pacemaker1.0.5/heartbeat3/RHEL5.3)
Li, Ling (Ling)
lli1 at alcatel-lucent.com
Fri Oct 9 01:02:57 EDT 2009
Hi,
rpms are downloaded from
http://download.opensuse.org/repositories/server:/haclustering/RHEL_5/x86_64
We have a cluster of two nodes with 17 resources configured.
Among those 17 resources, 4 clones, 1 group, and 12 primitives.
Cluster options are
symmetric-cluster=true
stonith-enabled=false
The rest are default.
xmllint -relaxng /usr/share/pacemaker/pacemaker.rng cib.xml
returns
cib.xml validates
The cluster works fine except that crmd is killed by signal 11 sporadically.
So far I have the following four causes. The first one is the most common one.
1. Core was generated by `/usr/lib64/heartbeat/crmd'.
Program terminated with signal 11, Segmentation fault.
[New process 2543]
#0 0x0000000000428d7f in te_graph_trigger ()
(gdb) bt
#0 0x0000000000428d7f in te_graph_trigger ()
#1 0x00002ab6d4a3df63 in crm_trigger_dispatch (source=0xda327a0, callback=0x428d34 <te_graph_trigger>,
userdata=0xda327a0) at mainloop.c:53
#2 0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#3 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
#4 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#5 0x0000000000405020 in crmd_init ()
#6 0x0000000000404efb in main ()
---------
2. Program terminated with signal 11, Segmentation fault.
[New process 567]
#0 0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003e8bc7b2f0 in strcasecmp () from /lib64/libc.so.6
#1 0x00002b71d9ac0bf7 in crm_str_eq (a=0x2b71d9ad85fb "__name__",
b=0x22296c6c <Address 0x22296c6c out of bounds>, use_case=0) at utils.c:1848
#2 0x00002b71d9ac60e4 in log_data_element (function=0x43444f "do_lrm_query",
prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0xa96b170, formatted=1)
at xml.c:1175
#3 0x00002b71d9ac4a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
msg=0xa96b170, text=0x4344a1 "Current state of the LRM") at xml.c:775
#4 0x000000000041b773 in do_lrm_query ()
#5 0x0000000000413ebc in do_cl_join_finalize_respond ()
#6 0x0000000000405ad8 in do_fsa_action ()
#7 0x0000000000406620 in s_crmd_fsa_actions ()
#8 0x0000000000405f9c in s_crmd_fsa ()
#9 0x0000000000411b35 in crm_fsa_trigger ()
#10 0x00002b71d9ad4f63 in crm_trigger_dispatch (source=0xa95dd70, callback=0x411acf <crm_fsa_trigger>,
userdata=0xa95dd70) at mainloop.c:53
#11 0x0000003e8dc2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#12 0x0000003e8dc2fc0d in ?? () from /lib64/libglib-2.0.so.0
#13 0x0000003e8dc2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#14 0x0000000000405020 in crmd_init ()
#15 0x0000000000404efb in main ()
-------
3. Core was generated by `/usr/lib64/heartbeat/crmd'.
Program terminated with signal 6, Aborted.
[New process 4101]
#0 0x0000003839e30215 in raise () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003839e30215 in raise () from /lib64/libc.so.6
#1 0x0000003839e31cc0 in abort () from /lib64/libc.so.6
#2 0x00002ba6d6c9238d in crm_abort (file=0x430f89 "election.c",
function=0x430ff0 "do_election_count_vote", line=265,
assert_condition=0x431138 "crm_str_eq(fsa_our_uuid, election_owner, TRUE)", do_core=1, do_fork=0)
at utils.c:1375
#3 0x0000000000412712 in do_election_count_vote ()
#4 0x0000000000405ad8 in do_fsa_action ()
#5 0x00000000004066d4 in s_crmd_fsa_actions ()
#6 0x0000000000405f9c in s_crmd_fsa ()
#7 0x0000000000411b35 in crm_fsa_trigger ()
#8 0x00002ba6d6ca7f63 in crm_trigger_dispatch (source=0x143cbd70, callback=0x411acf <crm_fsa_trigger>,
userdata=0x143cbd70) at mainloop.c:53
#9 0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#10 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
#11 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#12 0x0000000000405020 in crmd_init ()
#13 0x0000000000404efb in main ()
---
4. Core was generated by `/usr/lib64/heartbeat/crmd'
Program terminated with signal 11, Segmentation fault.
[New process 6817]
#0 0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
(gdb) bt
#0 0x0000003839e7b2f0 in strcasecmp () from /lib64/libc.so.6
#1 0x00002b57db35dbf7 in crm_str_eq (a=0x2b57db3755fb "__name__",
b=0xb96ac02022296c6c <Address 0xb96ac02022296c6c out of bounds>, use_case=0) at utils.c:1848
#2 0x00002b57db3630e4 in log_data_element (function=0x43444f "do_lrm_query",
prefix=0x4344a1 "Current state of the LRM", log_level=9, depth=0, data=0x6b95f00, formatted=1)
at xml.c:1175
#3 0x00002b57db361a44 in print_xml_formatted (log_level=9, function=0x43444f "do_lrm_query",
msg=0x6b95f00, text=0x4344a1 "Current state of the LRM") at xml.c:775
#4 0x000000000041b773 in do_lrm_query ()
#5 0x0000000000413ebc in do_cl_join_finalize_respond ()
#6 0x0000000000405ad8 in do_fsa_action ()
#7 0x0000000000406620 in s_crmd_fsa_actions ()
#8 0x0000000000405f9c in s_crmd_fsa ()
#9 0x0000000000411b35 in crm_fsa_trigger ()
#10 0x00002b57db371f63 in crm_trigger_dispatch (source=0x6b91d70, callback=0x411acf <crm_fsa_trigger>,
userdata=0x6b91d70) at mainloop.c:53
#11 0x000000383be2cdb4 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
#12 0x000000383be2fc0d in ?? () from /lib64/libglib-2.0.so.0
#13 0x000000383be2ff1a in g_main_loop_run () from /lib64/libglib-2.0.so.0
#14 0x0000000000405020 in crmd_init ()
#15 0x0000000000404efb in main ()
----
I ran crm_verify -VVVVV -L
The output has
1. for each line of cib.xml there is a message :" debug: debug2: log_data_element: get_xpath_object: Bad Input".
2. many find_xml_node such as
find_xml_node: Could not find operations in clone.
find_xml_node: Could not find group in clone.
Find_xml_node: Could not find operations in primitive. (but each primitive has a monitor operation)
3. Warnings found during check: config may not be valid
My questions:
1. Can I ignore 1 and 2 since cib.xml passed the xmllint validation?
2. which tool can I use to make sure the cib.xml is absolute correct?
Thanks,
Ling Li
More information about the Pacemaker
mailing list