[ClusterLabs] cib state is now lost
Ken Gaillot
kgaillot at redhat.com
Wed Aug 12 23:07:15 UTC 2015
On 08/12/2015 05:29 AM, David Neudorfer wrote:
> Thanks Ken,
>
> We're currently using Pacemaker 1.1.11 and at the moment its not an option
> to upgrade.
> I've spun up and down these boxes on AWS and even tried different sizes. I
> think a recent upgrade broke this deploy.
What OS distribution/version are you using?
If you have the option of switching from corosync 1+plugin to either
corosync 1+CMAN or corosync 2, that should avoid the issue, and put you
in a better supported position going forward. The plugin code has known
memory issues when nodes come and go, and the effects can be unpredictable.
> This is the output from dmesg:
>
> cib[16656] general protection ip:7f45391e9545 sp:7ffddf16c8b8 error:0 in
> libc-2.12.so[7f45390be000+18a000]
> cib[16659] general protection ip:7fa36fa89545 sp:7ffe28416288 error:0 in
> libc-2.12.so[7fa36f95e000+18a000]
> cib[16663] general protection ip:7fa3defce545 sp:7ffeb5b29c58 error:0 in
> libc-2.12.so[7fa3deea3000+18a000]
> cib[16666] general protection ip:7fa1cefe4545 sp:7ffcc4b9c778 error:0 in
> libc-2.12.so[7fa1ceeb9000+18a000]
> cib[16669] general protection ip:7f4b3900f545 sp:7ffdcd65aaf8 error:0 in
> libc-2.12.so[7f4b38ee4000+18a000]
> cib[16672] general protection ip:7fc38be2b545 sp:7fffbc7e1598 error:0 in
> libc-2.12.so[7fc38bd00000+18a000]
> cib[16675] general protection ip:7f9c6890c545 sp:7ffca09539f8 error:0 in
> libc-2.12.so[7f9c687e1000+18a000]
> cib[16678] general protection ip:7f1c636ad545 sp:7ffc677d2008 error:0 in
> libc-2.12.so[7f1c63582000+18a000]
> cib[16681] general protection ip:7fed0b47e545 sp:7ffd051f0618 error:0 in
> libc-2.12.so[7fed0b353000+18a000]
> cib[16684] general protection ip:7f2ee87cd545 sp:7fff8d9ae288 error:0 in
> libc-2.12.so[7f2ee86a2000+18a000]
> cib[16687] general protection ip:7f41c3789545 sp:7fff9f005848 error:0 in
> libc-2.12.so[7f41c365e000+18a000]
>
>
>
> On Mon, Aug 10, 2015 at 9:54 AM, Ken Gaillot <kgaillot at redhat.com> wrote:
>
>> On 08/09/2015 02:27 PM, David Neudorfer wrote:
>>> Where can I dig deeper to figure out why cib keeps terminating? selinux
>> and
>>> iptables are both disabled and I've have debug enabled. Google hasn't
>> been
>>> able to help me thus far.
>>>
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: debug:
>>> get_local_nodeid: Local nodeid is 84939948
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> plugin_get_details: Server details: id=84939948 uname=ip-172-20-16-5
>>> cname=pcmk
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_get_peer: Created entry
>>> c1f204b2-c994-48d9-81b6-87e1a7fc1ee7/0xa2c460 for node
>>> ip-172-20-16-5/84939948 (1 total)
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_get_peer: Node 84939948 is now known as ip-172-20-16-5
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_get_peer: Node 84939948 has uuid ip-172-20-16-5
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_update_peer_proc: init_cs_connection_classic: Node
>>> ip-172-20-16-5[84939948] - unknown is now online
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> init_cs_connection_once: Connection to 'classic openais (with
>>> plugin)': established
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> get_node_name: Defaulting to uname -n for the local classic
>> openais
>>> (with plugin) node name
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> qb_ipcs_us_publish: server name: cib_ro
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> qb_ipcs_us_publish: server name: cib_rw
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> qb_ipcs_us_publish: server name: cib_shm
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info: cib_init:
>>> Starting cib mainloop
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> plugin_handle_membership: Membership 104: quorum acquired
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_update_peer_proc: plugin_handle_membership: Node
>>> ip-172-20-16-5[84939948] - unknown is now member
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> crm_update_peer_state: cib_peer_update_callback: Node
>>> ip-172-20-16-5[84939948] - state is now lost (was (null))
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> crm_reap_dead_member: Removing ip-172-20-16-5/84939948 from the
>>> membership list
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> reap_crm_member: Purged 1 peers with id=84939948 and/or uname=(null)
>>> from the membership cache
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: notice:
>>> crm_update_peer_state: plugin_handle_membership: Node
>> ��[2077843320]
>>> - state is now member (was member)
>>> Aug 09 18:54:29 [12526] ip-172-20-16-5 cib: info:
>>> crm_update_peer: plugin_handle_membership: Node ��: id=2077843320
>>> state=r(0) ip(172.20.16.5) addr=r(0) ip(172.20.16.5) (new) votes=1
>>> (new) born=104 seen=104 proc=00000000000000000000000000111312
>>
>> The unprintable characters strongly implies memory corruption. There are
>> known issues with that when using the legacy plugin with some versions
>> of pacemaker. What version are you using? If you are compiling yourself,
>> I would recommend using the current upstream master branch (not 1.1.13,
>> which has the issue).
>>
>> An even better solution would be to switch to corosync 2 instead of the
>> plugin, as corosync 2 gets more development and testing these days.
>>
>>>
>>> https://gist.github.com/davidneudorfer/bc97082a9d9dfb12985b
More information about the Users
mailing list