[Pacemaker] Corosync crashes when cluster NIC disabled (Something strange happened)

Simpson, John R john_simpson at reyrey.com
Wed Mar 31 20:07:10 UTC 2010


Greetings all,

I have a lab cluster using Pacemaker 1.0.8 and Corosync 1.2.0-1
(see packages below) on CentOS 5.4 (32-bit) VM's running under
VMware ESXi 3.5.  My location constraints and connectivity
tests were working well, so I was feeling really good when 
I decided to shut down the interface used for cluster 
communication and verify that it resulted in a split-brain cluster.

Much to my dismay, corosync crashed almost immediately on the node
where I shut down the Ethernet interface.  I can recreate the issue
at will on this cluster and a different cluster running a slightly
more recent version of Pacemaker 1.0.8 and the same version of 
Corosync on CentOS 5.4 64-bit VMs.

I've attached the log, but here is the most suspicious message:

Mar 31 15:35:16 corosync [pcmk  ] ERROR: pcmk_peer_update: Something strange happened: 1

Cluster communication is on 172.16.0.0/24 (eth1) and Apache, etc. are on 10.127.252.0/24 (eth0).

I've tried to include or attach all the relevant information -- please let me know if there's anything else that would be useful.

Regards,

John Simpson

[root at cy-ha01 ~]# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
10.0.0.0        0.0.0.0         255.255.255.0   U         0 0     0 eth3
172.16.0.0      0.0.0.0         255.255.255.0   U         0 0     0 eth1
192.168.0.0     0.0.0.0         255.255.255.0   U         0 0     0 eth2
10.127.252.0    0.0.0.0         255.255.255.0   U         0 0     0 eth0
169.254.0.0     0.0.0.0         255.255.0.0     U         0 0     0 eth3
224.0.0.0       0.0.0.0         240.0.0.0       U         0 0     0 eth1
0.0.0.0         10.127.252.1    0.0.0.0         UG        0 0     0 eth0

[root at cy-ha01 ~]# date ; ifconfig eth1 down
Wed Mar 31 15:35:03 EDT 2010

Output from crm_mon when eth1 is shut down.
============
Last updated: Wed Mar 31 15:31:50 2010
Stack: openais
Current DC: cy-ha02 - partition with quorum
Version: 1.0.8-2a76c6ac04bcccf42b89a08e55bfbd90da2fb49a
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ cy-ha01 cy-ha02 ]

 Resource Group: WebSiteGroup
     ServiceIP  (ocf::heartbeat:IPaddr2):       Started cy-ha01
     WebSite    (ocf::heartbeat:apache):        Started cy-ha01
 Clone Set: CloneConnectivityTest
     Started: [ cy-ha02 cy-ha01 ]
Connection to the CIB terminated
Reconnecting................................

[root at cy-ha01 ~]# rpm -qa | grep pace
pacemaker-libs-devel-1.0.8-1.el5
pacemaker-1.0.8-1.el5
pacemaker-libs-1.0.8-1.el5
[root at cy-ha01 ~]# rpm -qa | grep coros
corosynclib-1.2.0-1.el5
corosync-1.2.0-1.el5
corosynclib-devel-1.2.0-1.el5

--
John Simpson 
Senior Software Engineer, I. T. Engineering and Operations

-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync-crash.xml
Type: text/xml
Size: 4729 bytes
Desc: corosync-crash.xml
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0003.xml>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosync-crash-log.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0006.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: corosync-conf.txt
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100331/ae9be085/attachment-0007.txt>


More information about the Pacemaker mailing list