[Pacemaker] Centos 6.2 corosync errors after reboot prevent joining

Tue Jul 3 08:51:15 UTC 2012

On Tue, Jul 3, 2012 at 9:42 AM, Dan Frincu <df.cluster at gmail.com> wrote:
>
> Hi,
>
> On Mon, Jul 2, 2012 at 7:47 PM, Martin de Koning <martindk80 at gmail.com> wrote:
> > Hi all,
> >
> > Reasonably new to pacemaker and having some issues with corosync loading the
> > pacemaker plugin after a reboot of the node. It looks like similar issues
> > have been posted before but I haven't found a relavent fix.
> >
> > The Centos 6.2 node was online before the reboot and restarting the corosync
> > and pacemaker services caused no issues. Since the reboot and subsequent
> > reboots, I am unable to get pacemaker to join the cluster.
> >
> > After the reboot corosync now reports the following:
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.cib failed: ipc delivery failed
> > (rc=-2)
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.cib failed: ipc delivery failed
> > (rc=-2)
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.cib failed: ipc delivery failed
> > (rc=-2)
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.cib failed: ipc delivery failed
> > (rc=-2)
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.cib failed: ipc delivery failed
> > (rc=-2)
> > Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> > route_ais_message: Sending message to local.crmd failed: ipc delivery failed
> > (rc=-2)
> >
> > The full syslog is here:
> > http://pastebin.com/raw.php?i=f9eBuqUh
> >
> > corosync-1.4.1-4.el6_2.3.x86_64
> > pacemaker-1.1.6-3.el6.x86_64
> >
> > I have checked the the obvious such as inter-cluster communication and
> > firewall rules. It appears to me that there may be an issue with the with
> > Pacemaker cluster information base and not corosync. Any ideas? Can I clear
> > the CIB manually somehow to resolve this?
>
> What does "corosync-objctl | grep member" return? Can you see the same
> multicast groups on all of the nodes when you run "netstat -ng"?
>
> To clear the CIB manually do a "rm -rfi /var/lib/heartbeat/crm/*" on
> the faulty node (with corosync and pacemaker stopped), then start
> corosync and pacemaker.
>
> HTH,
> Dan
>

Hi Dan,

Thanks for your response.

[root at sessredis-03 ~]# corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.520202432.ip=r(0) ip(192.168.1.31)
runtime.totem.pg.mrp.srp.members.520202432.join_count=1
runtime.totem.pg.mrp.srp.members.520202432.status=joined
runtime.totem.pg.mrp.srp.members.268544192.ip=r(0) ip(192.168.1.16)
runtime.totem.pg.mrp.srp.members.268544192.join_count=1
runtime.totem.pg.mrp.srp.members.268544192.status=joined

192.168.1.31 (sessredis-03) is the failed node.

[root at sessredis-03 ~]# netstat -ng
IPv6/IPv4 Group Memberships
Interface       RefCnt Group
--------------- ------ ---------------------
lo              1      224.0.0.1
eth0            1      224.0.0.1
eth1            1      226.94.1.2
eth1            1      224.0.0.1

By looking at the logs on a healthy node the failed member is
successfully joining the ring.

Corosync logs on healthy node:
http://pastebin.com/raw.php?i=dve1jFbD

I have since taken a closer look at the healthy node and I see that
the CIB is empty:

[root at sessredis-03 ~]# crm configure show
node sessredis-01.localdomain
node sessredis-03.localdomain
property $id="cib-bootstrap-options" \
dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
cluster-infrastructure="openais" \
expected-quorum-votes="2"

[root at sessredis-01 ~]# crm status
============
Last updated: Tue Jul  3 10:09:38 2012
Last change: Mon Jul  2 17:51:55 2012 via crmd on sessredis-01.localdomain
Stack: openais
Current DC: sessredis-01.localdomain - partition with quorum
Version: 1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Node sessredis-03.localdomain: pending

Online: [ sessredis-01.localdomain ]

 redis-3:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain
 redis-0:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain
 redis-4:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain
 sessredis-vip (ocf::heartbeat:IPaddr2): ORPHANED Started
sessredis-01.localdomain
 redis-1:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain
 redis-2:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain
 redis-5:0 (ocf::booking.com:redis): ORPHANED Master sessredis-01.localdomain

I've looked at the logs and don't see any obvious errors.

However, restarting pacemaker and corosync on the healthy node has not
resolved my empty CIB issue but it has resolved my original problem
with Corosync not being able to start the pacemaker plugin (on the
original rebooted node). So after restarting pacemaker and corosync on
both nodes I see both nodes online but have now lost my original
cluster configuration.

Any ideas?

Cheers
Martin

>
>
> --
> Dan Frincu
> CCNA, RHCE
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org