[Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker

Mon Aug 30 14:34:15 UTC 2010

  Le 27/08/2010 16:29, Andrew Beekhof a écrit :
> On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
> <guillaume.chanaud at connecting-nature.com>  wrote:
>> Hello,
>> sorry for the delay it took, july is not the best month to get things
>> working fast.
> Neither is august :-)
>
lol sure :)
>> Here is the core dump file (55MB) :
>> http://www.connecting-nature.com/corosync/core
>> corosync version is 1.2.3
> Sorry, but I can't do anything with that file.
> Core files are only usable on the machine they came from.
>
> you'll have to open it with gdb and type "bt" to get a backtrace.
Sorry , saw that after sending last mail. In fact i tried to debug/bt 
it, but
1. I'm not a c developer (i understand a little about it...)
2. I never used gdb before uh, so hard to step into the corosync debug

I'm not sure the trace will be usefull but here it is :
Core was generated by `corosync'.
Program terminated with signal 6, Aborted.
#0  0x0000003506a329a5 in raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
64      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
(gdb) bt
#0  0x0000003506a329a5 in raise (sig=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1  0x0000003506a34185 in abort () at abort.c:92
#2  0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae 
"token_memb_entries >= 1", file=<value optimized out>, line=1194,
     function=<value optimized out>) at assert.c:81
#3  0x00007fce14efb716 in memb_consensus_agreed 
(instance=0x7fce12338010) at totemsrp.c:1194
#4  0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010, 
memb_join=0x822bf8) at totemsrp.c:3922
#5  0x00007fce14f01a3a in message_handler_memb_join 
(instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value 
optimized out>,
     endian_conversion_needed=<value optimized out>) at totemsrp.c:4165
#6  0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>, 
msg=0x822bf8, msg_len=420) at totemrrp.c:1404
#7  0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>, 
fd=<value optimized out>, revents=<value optimized out>, data=0x822550)
     at totemudp.c:1244
#8  0x00007fce14ef259a in poll_run (handle=2240235047305084928) at 
coropoll.c:435
#9  0x0000000000405594 in main (argc=<value optimized out>, argv=<value 
optimized out>) at main.c:1558

I tried to compile it from source (1.2.7 tag and svn trunk) but i'm 
unable to backtrace it as gdb tell me he doesn't find debuginfos (i did 
a ./configure --enable-debug but gdb seems to need a 
/usr/lib/debug/.build-id/... related to current executable, and i don't 
know how to generate this)
On the 1.2.7 version, init script tell it started correctly but after 
one or two seconds only lrmd and pengine processes are still alive

On the trunk version, the init script fail to start (and so processes 
are correctly killed)

In the 1.2.7 when i'm stepping, i'm unable to go further than
service.c:201        res = service->exec_init_fn (corosync_api);
as it should create a new process for pacemaker services i think
(i don't know how to step inside this new process and debug it)

If you need/want i'll let you access this vm via ssh to test/debug it.

It should be related to other posts about "Could not connect to the CIB 
service: connection failed" (i saw some message related to things more 
or less like my problem)

I put back end of the messages log here :
Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership 
208656: quorum acquired
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node 
www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0) 
ip(192.168.0.60)  (
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now 
has id: 83929280
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): 
id=83929280 state=member (new) addr=r(0) ip(192.168.0.5)  votes=0 born=0 
seen=20865
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 
filer2.connecting-nature.com now has id: 100706496
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 
is now known as filer2.connecting-nature.com
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node 
filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0) 
ip(192.168.0.6)  vo
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now 
has id: 1174448320
Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null): 
id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70)  votes=0 
born=0 seen=20
Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is 
operational
Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State 
transition S_STARTING -> S_PENDING [ input=I_PENDING 
cause=C_FSA_INTERNAL origin=do_st
Aug 30 16:30:50 www01 corosync[19809]:   [TOTEM ] FAILED TO RECEIVE
Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 
208656: quorum retained
Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: 
Attempting connection to fencing daemon...
Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected
Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Resource temporarily unavailable 
(11)
Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection 
failed
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: 
Receiving message body failed: (2) Library error: Resource temporarily 
unavailable (11)
Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS 
connection terminated
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS 
connection failed
Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: 
stonith_peer_ais_destroy: AIS connection terminated
Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Invalid argument (22)
Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS 
connection failed
Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost 
connection to OpenAIS service!
Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...
Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost 
connection to the CIB service [19817].
Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost 
connection to the CIB service [19817/callback].
Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost 
connection to the CIB service [19817/command].
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy: 
Connection to the CIB terminated...
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving 
message body failed: (2) Library error: Invalid argument (22)
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection 
failed
Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS 
connection terminated

Strange things is that crmd find the hostname for 
filer2.connectng-nature.com (which is the DC), but set it to <null> for 
all other cluster nodes

Thanks !
Guillaume