[Pacemaker] Centos 6.2 corosync errors after reboot prevent joining

Tue Jul 3 07:42:21 UTC 2012

Hi,

On Mon, Jul 2, 2012 at 7:47 PM, Martin de Koning <martindk80 at gmail.com> wrote:
> Hi all,
>
> Reasonably new to pacemaker and having some issues with corosync loading the
> pacemaker plugin after a reboot of the node. It looks like similar issues
> have been posted before but I haven't found a relavent fix.
>
> The Centos 6.2 node was online before the reboot and restarting the corosync
> and pacemaker services caused no issues. Since the reboot and subsequent
> reboots, I am unable to get pacemaker to join the cluster.
>
> After the reboot corosync now reports the following:
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.cib failed: ipc delivery failed
> (rc=-2)
> Jul  2 17:56:22 sessredis-03 corosync[1644]:   [pcmk  ] WARN:
> route_ais_message: Sending message to local.crmd failed: ipc delivery failed
> (rc=-2)
>
> The full syslog is here:
> http://pastebin.com/raw.php?i=f9eBuqUh
>
> corosync-1.4.1-4.el6_2.3.x86_64
> pacemaker-1.1.6-3.el6.x86_64
>
> I have checked the the obvious such as inter-cluster communication and
> firewall rules. It appears to me that there may be an issue with the with
> Pacemaker cluster information base and not corosync. Any ideas? Can I clear
> the CIB manually somehow to resolve this?

What does "corosync-objctl | grep member" return? Can you see the same
multicast groups on all of the nodes when you run "netstat -ng"?

To clear the CIB manually do a "rm -rfi /var/lib/heartbeat/crm/*" on
the faulty node (with corosync and pacemaker stopped), then start
corosync and pacemaker.

HTH,
Dan

>
> Cheers
> Martin
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
Dan Frincu
CCNA, RHCE