[Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker
Guillaume Chanaud
guillaume.chanaud at connecting-nature.com
Mon Aug 30 14:34:15 UTC 2010
Le 27/08/2010 16:29, Andrew Beekhof a écrit :
> On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
> <guillaume.chanaud at connecting-nature.com> wrote:
>> Hello,
>> sorry for the delay it took, july is not the best month to get things
>> working fast.
> Neither is august :-)
>
lol sure :)
>> Here is the core dump file (55MB) :
>> http://www.connecting-nature.com/corosync/core
>> corosync version is 1.2.3
> Sorry, but I can't do anything with that file.
> Core files are only usable on the machine they came from.
>
> you'll have to open it with gdb and type "bt" to get a backtrace.
Sorry , saw that after sending last mail. In fact i tried to debug/bt
it, but
1. I'm not a c developer (i understand a little about it...)
2. I never used gdb before uh, so hard to step into the corosync debug
I'm not sure the trace will be usefull but here it is :
Core was generated by `corosync'.
Program terminated with signal 6, Aborted.
#0 0x0000003506a329a5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64 return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0 0x0000003506a329a5 in raise (sig=6) at
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x0000003506a34185 in abort () at abort.c:92
#2 0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae
"token_memb_entries >= 1", file=<value optimized out>, line=1194,
function=<value optimized out>) at assert.c:81
#3 0x00007fce14efb716 in memb_consensus_agreed
(instance=0x7fce12338010) at totemsrp.c:1194
#4 0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010,
memb_join=0x822bf8) at totemsrp.c:3922
#5 0x00007fce14f01a3a in message_handler_memb_join
(instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value
optimized out>,
endian_conversion_needed=<value optimized out>) at totemsrp.c:4165
#6 0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>,
msg=0x822bf8, msg_len=420) at totemrrp.c:1404
#7 0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>,
fd=<value optimized out>, revents=<value optimized out>, data=0x822550)
at totemudp.c:1244
#8 0x00007fce14ef259a in poll_run (handle=2240235047305084928) at
coropoll.c:435
#9 0x0000000000405594 in main (argc=<value optimized out>, argv=<value
optimized out>) at main.c:1558
I tried to compile it from source (1.2.7 tag and svn trunk) but i'm
unable to backtrace it as gdb tell me he doesn't find debuginfos (i did
a ./configure --enable-debug but gdb seems to need a
/usr/lib/debug/.build-id/... related to current executable, and i don't
know how to generate this)
On the 1.2.7 version, init script tell it started correctly but after
one or two seconds only lrmd and pengine processes are still alive
On the trunk version, the init script fail to start (and so processes
are correctly killed)
In the 1.2.7 when i'm stepping, i'm unable to go further than
service.c:201 res = service->exec_init_fn (corosync_api);
as it should create a new process for pacemaker services i think
(i don't know how to step inside this new process and debug it)
If you need/want i'll let you access this vm via ssh to test/debug it.
It should be related to other posts about "Could not connect to the CIB
service: connection failed" (i saw some message related to things more
or less like my problem)
I put back end of the messages log here :
Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership
208656: quorum acquired
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0)
ip(192.168.0.60) (
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now
has id: 83929280
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
id=83929280 state=member (new) addr=r(0) ip(192.168.0.5) votes=0 born=0
seen=20865
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node
filer2.connecting-nature.com now has id: 100706496
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496
is now known as filer2.connecting-nature.com
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0)
ip(192.168.0.6) vo
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now
has id: 1174448320
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70) votes=0
born=0 seen=20
Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is
operational
Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State
transition S_STARTING -> S_PENDING [ input=I_PENDING
cause=C_FSA_INTERNAL origin=do_st
Aug 30 16:30:50 www01 corosync[19809]: [TOTEM ] FAILED TO RECEIVE
Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership
208656: quorum retained
Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith:
Attempting connection to fencing daemon...
Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected
Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving
message body failed: (2) Library error: Resource temporarily unavailable
(11)
Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection
failed
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch:
Receiving message body failed: (2) Library error: Resource temporarily
unavailable (11)
Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS
connection terminated
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS
connection failed
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR:
stonith_peer_ais_destroy: AIS connection terminated
Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving
message body failed: (2) Library error: Invalid argument (22)
Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS
connection failed
Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost
connection to OpenAIS service!
Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...
Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost
connection to the CIB service [19817].
Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
connection to the CIB service [19817/callback].
Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
connection to the CIB service [19817/command].
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy:
Connection to the CIB terminated...
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving
message body failed: (2) Library error: Invalid argument (22)
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection
failed
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS
connection terminated
Strange things is that crmd find the hostname for
filer2.connectng-nature.com (which is the DC), but set it to <null> for
all other cluster nodes
Thanks !
Guillaume
More information about the Pacemaker
mailing list