[Pacemaker] Corosync crashes when cluster NIC disabled (Something strange happened)
Simpson, John R
john_simpson at reyrey.com
Wed Mar 31 20:07:10 UTC 2010
Greetings all,
I have a lab cluster using Pacemaker 1.0.8 and Corosync 1.2.0-1
(see packages below) on CentOS 5.4 (32-bit) VM's running under
VMware ESXi 3.5. My location constraints and connectivity
tests were working well, so I was feeling really good when
I decided to shut down the interface used for cluster
communication and verify that it resulted in a split-brain cluster.
Much to my dismay, corosync crashed almost immediately on the node
where I shut down the Ethernet interface. I can recreate the issue
at will on this cluster and a different cluster running a slightly
more recent version of Pacemaker 1.0.8 and the same version of
Corosync on CentOS 5.4 64-bit VMs.
I've attached the log, but here is the most suspicious message:
Mar 31 15:35:16 corosync [pcmk ] ERROR: pcmk_peer_update: Something strange happened: 1
Cluster communication is on 172.16.0.0/24 (eth1) and Apache, etc. are on 10.127.252.0/24 (eth0).
I've tried to include or attach all the relevant information -- please let me know if there's anything else that would be useful.
Regards,
John Simpson
[root at cy-ha01 ~]# netstat -rn
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth3
172.16.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2
10.127.252.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth3
224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth1
0.0.0.0 10.127.252.1 0.0.0.0 UG 0 0 0 eth0
[root at cy-ha01 ~]# date ; ifconfig eth1 down
Wed Mar 31 15:35:03 EDT 2010
Output from crm_mon when eth1 is shut down.
============
Last updated: Wed Mar 31 15:31:50 2010
Stack: openais
Current DC: cy-ha02 - partition with quorum
Version: 1.0.8-2a76c6ac04bcccf42b89a08e55bfbd90da2fb49a
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ cy-ha01 cy-ha02 ]
Resource Group: WebSiteGroup
ServiceIP (ocf::heartbeat:IPaddr2): Started cy-ha01
WebSite (ocf::heartbeat:apache): Started cy-ha01
Clone Set: CloneConnectivityTest
Started: [ cy-ha02 cy-ha01 ]
Connection to the CIB terminated
Reconnecting................................
[root at cy-ha01 ~]# rpm -qa | grep pace
pacemaker-libs-devel-1.0.8-1.el5
pacemaker-1.0.8-1.el5
pacemaker-libs-1.0.8-1.el5
[root at cy-ha01 ~]# rpm -qa | grep coros
corosynclib-1.2.0-1.el5
corosync-1.2.0-1.el5
corosynclib-devel-1.2.0-1.el5
--
John Simpson
Senior Software Engineer, I. T. Engineering and Operations
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync-crash.xml
Type: text/xml
Size: 4729 bytes
Desc: corosync-crash.xml
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0003.xml>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosync-crash-log.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0006.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosync-conf.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0007.txt>
More information about the Pacemaker
mailing list