[Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker

Thu Sep 2 06:52:56 UTC 2010

On Mon, Aug 30, 2010 at 4:34 PM, Guillaume Chanaud
<guillaume.chanaud at connecting-nature.com> wrote:
>  Le 27/08/2010 16:29, Andrew Beekhof a écrit :
>>
>> On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
>> <guillaume.chanaud at connecting-nature.com>  wrote:
>>>
>>> Hello,
>>> sorry for the delay it took, july is not the best month to get things
>>> working fast.
>>
>> Neither is august :-)
>>
> lol sure :)
>>>
>>> Here is the core dump file (55MB) :
>>> http://www.connecting-nature.com/corosync/core
>>> corosync version is 1.2.3
>>
>> Sorry, but I can't do anything with that file.
>> Core files are only usable on the machine they came from.
>>
>> you'll have to open it with gdb and type "bt" to get a backtrace.
>
> Sorry , saw that after sending last mail. In fact i tried to debug/bt it,
> but
> 1. I'm not a c developer (i understand a little about it...)
> 2. I never used gdb before uh, so hard to step into the corosync debug
>
> I'm not sure the trace will be usefull but here it is :
> Core was generated by `corosync'.
> Program terminated with signal 6, Aborted.
> #0  0x0000003506a329a5 in raise (sig=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> 64      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
> (gdb) bt
> #0  0x0000003506a329a5 in raise (sig=6) at
> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
> #1  0x0000003506a34185 in abort () at abort.c:92
> #2  0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae
> "token_memb_entries >= 1", file=<value optimized out>, line=1194,
>    function=<value optimized out>) at assert.c:81
> #3  0x00007fce14efb716 in memb_consensus_agreed (instance=0x7fce12338010) at
> totemsrp.c:1194
> #4  0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010,
> memb_join=0x822bf8) at totemsrp.c:3922
> #5  0x00007fce14f01a3a in message_handler_memb_join
> (instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value
> optimized out>,
>    endian_conversion_needed=<value optimized out>) at totemsrp.c:4165
> #6  0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>,
> msg=0x822bf8, msg_len=420) at totemrrp.c:1404
> #7  0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>,
> fd=<value optimized out>, revents=<value optimized out>, data=0x822550)
>    at totemudp.c:1244
> #8  0x00007fce14ef259a in poll_run (handle=2240235047305084928) at
> coropoll.c:435
> #9  0x0000000000405594 in main (argc=<value optimized out>, argv=<value
> optimized out>) at main.c:1558

Ok, definitely a corosync bug.

>
> I tried to compile it from source (1.2.7 tag and svn trunk) but i'm unable
> to backtrace it as gdb tell me he doesn't find debuginfos (i did a
> ./configure --enable-debug but gdb seems to need a
> /usr/lib/debug/.build-id/... related to current executable, and i don't know
> how to generate this)

What about installing 1.2.7 from clusterlabs?
If you still see it with 1.2.7, you should definitely report this to
the openais mailing list.

> On the 1.2.7 version, init script tell it started correctly but after one or
> two seconds only lrmd and pengine processes are still alive
>
> On the trunk version, the init script fail to start (and so processes are
> correctly killed)
>
> In the 1.2.7 when i'm stepping, i'm unable to go further than
> service.c:201        res = service->exec_init_fn (corosync_api);
> as it should create a new process for pacemaker services i think
> (i don't know how to step inside this new process and debug it)
>
> If you need/want i'll let you access this vm via ssh to test/debug it.
>
> It should be related to other posts about "Could not connect to the CIB
> service: connection failed" (i saw some message related to things more or
> less like my problem)
>
> I put back end of the messages log here :
> Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership
> 208656: quorum acquired
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
> www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0)
> ip(192.168.0.60)  (
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has
> id: 83929280
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
> id=83929280 state=member (new) addr=r(0) ip(192.168.0.5)  votes=0 born=0
> seen=20865
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node
> filer2.connecting-nature.com now has id: 100706496
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 is
> now known as filer2.connecting-nature.com
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
> filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0)
> ip(192.168.0.6)  vo
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node <null> now has
> id: 1174448320
> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
> id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70)  votes=0 born=0
> seen=20
> Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is
> operational
> Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_st
> Aug 30 16:30:50 www01 corosync[19809]:   [TOTEM ] FAILED TO RECEIVE
> Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 208656:
> quorum retained
> Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: Attempting
> connection to fencing daemon...
> Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected
> Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving message
> body failed: (2) Library error: Resource temporarily unavailable (11)
> Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection
> failed
> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: Receiving
> message body failed: (2) Library error: Resource temporarily unavailable
> (11)
> Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS connection
> terminated
> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS
> connection failed
> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: stonith_peer_ais_destroy:
> AIS connection terminated
> Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving message
> body failed: (2) Library error: Invalid argument (22)
> Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS connection
> failed
> Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost
> connection to OpenAIS service!
> Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...
> Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost
> connection to the CIB service [19817].
> Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
> connection to the CIB service [19817/callback].
> Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
> connection to the CIB service [19817/command].
> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy:
> Connection to the CIB terminated...
> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving message
> body failed: (2) Library error: Invalid argument (22)
> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection
> failed
> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS connection
> terminated
>
> Strange things is that crmd find the hostname for
> filer2.connectng-nature.com (which is the DC), but set it to <null> for all
> other cluster nodes
>
> Thanks !
> Guillaume
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>