[Pacemaker] Corosync crashes when cluster NIC disabled (Something strange happened)
Steven Dake
sdake at redhat.com
Wed Mar 31 20:34:18 UTC 2010
On Wed, 2010-03-31 at 16:07 -0400, Simpson, John R wrote:
> Greetings all,
>
> I have a lab cluster using Pacemaker 1.0.8 and Corosync 1.2.0-1
> (see packages below) on CentOS 5.4 (32-bit) VM's running under
> VMware ESXi 3.5. My location constraints and connectivity
> tests were working well, so I was feeling really good when
> I decided to shut down the interface used for cluster
> communication and verify that it resulted in a split-brain cluster.
>
> Much to my dismay, corosync crashed almost immediately on the node
> where I shut down the Ethernet interface. I can recreate the issue
> at will on this cluster and a different cluster running a slightly
> more recent version of Pacemaker 1.0.8 and the same version of
> Corosync on CentOS 5.4 64-bit VMs.
>
> I've attached the log, but here is the most suspicious message:
>
> Mar 31 15:35:16 corosync [pcmk ] ERROR: pcmk_peer_update: Something strange happened: 1
>
> Cluster communication is on 172.16.0.0/24 (eth1) and Apache, etc. are on 10.127.252.0/24 (eth0).
>
> I've tried to include or attach all the relevant information -- please let me know if there's anything else that would be useful.
>
> Regards,
>
> John Simpson
>
I've answered this so many times on the ml I've created a faq for it.
If the faq is unclear, let me know, and we can add to it.
http://www.corosync.org/doku.php?id=faq:ifdown
You mentioned Corosync crashed(segfault?), which it should not
To report that crash, see the following faq:
http://www.corosync.org/doku.php?id=faq:crash
> [root at cy-ha01 ~]# netstat -rn
> Kernel IP routing table
> Destination Gateway Genmask Flags MSS Window irtt Iface
> 10.0.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth3
> 172.16.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth1
> 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth2
> 10.127.252.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
> 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth3
> 224.0.0.0 0.0.0.0 240.0.0.0 U 0 0 0 eth1
> 0.0.0.0 10.127.252.1 0.0.0.0 UG 0 0 0 eth0
>
> [root at cy-ha01 ~]# date ; ifconfig eth1 down
> Wed Mar 31 15:35:03 EDT 2010
>
> Output from crm_mon when eth1 is shut down.
> ============
> Last updated: Wed Mar 31 15:31:50 2010
> Stack: openais
> Current DC: cy-ha02 - partition with quorum
> Version: 1.0.8-2a76c6ac04bcccf42b89a08e55bfbd90da2fb49a
> 2 Nodes configured, 2 expected votes
> 2 Resources configured.
> ============
>
> Online: [ cy-ha01 cy-ha02 ]
>
> Resource Group: WebSiteGroup
> ServiceIP (ocf::heartbeat:IPaddr2): Started cy-ha01
> WebSite (ocf::heartbeat:apache): Started cy-ha01
> Clone Set: CloneConnectivityTest
> Started: [ cy-ha02 cy-ha01 ]
> Connection to the CIB terminated
> Reconnecting................................
>
> [root at cy-ha01 ~]# rpm -qa | grep pace
> pacemaker-libs-devel-1.0.8-1.el5
> pacemaker-1.0.8-1.el5
> pacemaker-libs-1.0.8-1.el5
> [root at cy-ha01 ~]# rpm -qa | grep coros
> corosynclib-1.2.0-1.el5
> corosync-1.2.0-1.el5
> corosynclib-devel-1.2.0-1.el5
>
> --
> John Simpson
> Senior Software Engineer, I. T. Engineering and Operations
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list