[Pacemaker] Two cloned VM, only one of the both shows online when starting corosync/pacemaker

Thu Sep 2 20:29:26 UTC 2010

  Ok i'll try to report this on openais (i tried the 1.2.7 from fedora 
repo before compiling from sources ;) )
> On Mon, Aug 30, 2010 at 4:34 PM, Guillaume Chanaud
> <guillaume.chanaud at connecting-nature.com>  wrote:
>>   Le 27/08/2010 16:29, Andrew Beekhof a écrit :
>>> On Tue, Aug 3, 2010 at 4:40 PM, Guillaume Chanaud
>>> <guillaume.chanaud at connecting-nature.com>    wrote:
>>>> Hello,
>>>> sorry for the delay it took, july is not the best month to get things
>>>> working fast.
>>> Neither is august :-)
>>>
>> lol sure :)
>>>> Here is the core dump file (55MB) :
>>>> http://www.connecting-nature.com/corosync/core
>>>> corosync version is 1.2.3
>>> Sorry, but I can't do anything with that file.
>>> Core files are only usable on the machine they came from.
>>>
>>> you'll have to open it with gdb and type "bt" to get a backtrace.
>> Sorry , saw that after sending last mail. In fact i tried to debug/bt it,
>> but
>> 1. I'm not a c developer (i understand a little about it...)
>> 2. I never used gdb before uh, so hard to step into the corosync debug
>>
>> I'm not sure the trace will be usefull but here it is :
>> Core was generated by `corosync'.
>> Program terminated with signal 6, Aborted.
>> #0  0x0000003506a329a5 in raise (sig=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>> 64      return INLINE_SYSCALL (tgkill, 3, pid, selftid, sig);
>> (gdb) bt
>> #0  0x0000003506a329a5 in raise (sig=6) at
>> ../nptl/sysdeps/unix/sysv/linux/raise.c:64
>> #1  0x0000003506a34185 in abort () at abort.c:92
>> #2  0x0000003506a2b935 in __assert_fail (assertion=0x7fce14f0b2ae
>> "token_memb_entries>= 1", file=<value optimized out>, line=1194,
>>     function=<value optimized out>) at assert.c:81
>> #3  0x00007fce14efb716 in memb_consensus_agreed (instance=0x7fce12338010) at
>> totemsrp.c:1194
>> #4  0x00007fce14f01723 in memb_join_process (instance=0x7fce12338010,
>> memb_join=0x822bf8) at totemsrp.c:3922
>> #5  0x00007fce14f01a3a in message_handler_memb_join
>> (instance=0x7fce12338010, msg=<value optimized out>, msg_len=<value
>> optimized out>,
>>     endian_conversion_needed=<value optimized out>) at totemsrp.c:4165
>> #6  0x00007fce14ef7644 in rrp_deliver_fn (context=<value optimized out>,
>> msg=0x822bf8, msg_len=420) at totemrrp.c:1404
>> #7  0x00007fce14ef6569 in net_deliver_fn (handle=<value optimized out>,
>> fd=<value optimized out>, revents=<value optimized out>, data=0x822550)
>>     at totemudp.c:1244
>> #8  0x00007fce14ef259a in poll_run (handle=2240235047305084928) at
>> coropoll.c:435
>> #9  0x0000000000405594 in main (argc=<value optimized out>, argv=<value
>> optimized out>) at main.c:1558
> Ok, definitely a corosync bug.
>
>> I tried to compile it from source (1.2.7 tag and svn trunk) but i'm unable
>> to backtrace it as gdb tell me he doesn't find debuginfos (i did a
>> ./configure --enable-debug but gdb seems to need a
>> /usr/lib/debug/.build-id/... related to current executable, and i don't know
>> how to generate this)
> What about installing 1.2.7 from clusterlabs?
> If you still see it with 1.2.7, you should definitely report this to
> the openais mailing list.
>
>> On the 1.2.7 version, init script tell it started correctly but after one or
>> two seconds only lrmd and pengine processes are still alive
>>
>> On the trunk version, the init script fail to start (and so processes are
>> correctly killed)
>>
>> In the 1.2.7 when i'm stepping, i'm unable to go further than
>> service.c:201        res = service->exec_init_fn (corosync_api);
>> as it should create a new process for pacemaker services i think
>> (i don't know how to step inside this new process and debug it)
>>
>> If you need/want i'll let you access this vm via ssh to test/debug it.
>>
>> It should be related to other posts about "Could not connect to the CIB
>> service: connection failed" (i saw some message related to things more or
>> less like my problem)
>>
>> I put back end of the messages log here :
>> Aug 30 16:30:50 www01 crmd: [19821]: notice: ais_dispatch: Membership
>> 208656: quorum acquired
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
>> www01.connecting-nature.com: id=1006676160 state=member (new) addr=r(0)
>> ip(192.168.0.60)  (
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node<null>  now has
>> id: 83929280
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
>> id=83929280 state=member (new) addr=r(0) ip(192.168.0.5)  votes=0 born=0
>> seen=20865
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node
>> filer2.connecting-nature.com now has id: 100706496
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node 100706496 is
>> now known as filer2.connecting-nature.com
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node
>> filer2.connecting-nature.com: id=100706496 state=member (new) addr=r(0)
>> ip(192.168.0.6)  vo
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_new_peer: Node<null>  now has
>> id: 1174448320
>> Aug 30 16:30:50 www01 crmd: [19821]: info: crm_update_peer: Node (null):
>> id=1174448320 state=member (new) addr=r(0) ip(192.168.0.70)  votes=0 born=0
>> seen=20
>> Aug 30 16:30:50 www01 crmd: [19821]: info: do_started: The local CRM is
>> operational
>> Aug 30 16:30:50 www01 crmd: [19821]: info: do_state_transition: State
>> transition S_STARTING ->  S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
>> origin=do_st
>> Aug 30 16:30:50 www01 corosync[19809]:   [TOTEM ] FAILED TO RECEIVE
>> Aug 30 16:30:51 www01 crmd: [19821]: info: ais_dispatch: Membership 208656:
>> quorum retained
>> Aug 30 16:30:51 www01 crmd: [19821]: info: te_connect_stonith: Attempting
>> connection to fencing daemon...
>> Aug 30 16:30:52 www01 crmd: [19821]: info: te_connect_stonith: Connected
>> Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: Receiving message
>> body failed: (2) Library error: Resource temporarily unavailable (11)
>> Aug 30 16:30:52 www01 cib: [19817]: ERROR: ais_dispatch: AIS connection
>> failed
>> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: Receiving
>> message body failed: (2) Library error: Resource temporarily unavailable
>> (11)
>> Aug 30 16:30:52 www01 cib: [19817]: ERROR: cib_ais_destroy: AIS connection
>> terminated
>> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: ais_dispatch: AIS
>> connection failed
>> Aug 30 16:30:52 www01 stonith-ng: [19816]: ERROR: stonith_peer_ais_destroy:
>> AIS connection terminated
>> Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: Receiving message
>> body failed: (2) Library error: Invalid argument (22)
>> Aug 30 16:30:52 www01 attrd: [19819]: ERROR: ais_dispatch: AIS connection
>> failed
>> Aug 30 16:30:52 www01 attrd: [19819]: CRIT: attrd_ais_destroy: Lost
>> connection to OpenAIS service!
>> Aug 30 16:30:52 www01 attrd: [19819]: info: main: Exiting...
>> Aug 30 16:30:52 www01 crmd: [19821]: info: cib_native_msgready: Lost
>> connection to the CIB service [19817].
>> Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
>> connection to the CIB service [19817/callback].
>> Aug 30 16:30:52 www01 crmd: [19821]: CRIT: cib_native_dispatch: Lost
>> connection to the CIB service [19817/command].
>> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crmd_cib_connection_destroy:
>> Connection to the CIB terminated...
>> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: Receiving message
>> body failed: (2) Library error: Invalid argument (22)
>> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: ais_dispatch: AIS connection
>> failed
>> Aug 30 16:30:52 www01 crmd: [19821]: ERROR: crm_ais_destroy: AIS connection
>> terminated
>>
>> Strange things is that crmd find the hostname for
>> filer2.connectng-nature.com (which is the DC), but set it to<null>  for all
>> other cluster nodes
>>
>> Thanks !
>> Guillaume
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs:
>> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker