[Pacemaker] problem with pacemaker/corosync on CentOS 6.3

Fri Jul 20 14:49:15 UTC 2012

----- Original Message -----
> From: fatcharly at gmx.de
> To: pacemaker at oss.clusterlabs.org
> Sent: Friday, July 20, 2012 6:08:45 AM
> Subject: [Pacemaker] problem with pacemaker/corosync  on CentOS 6.3
> 
> Hi,
> 
> I´m using a pacemaker+corosync bundle to run a pound based
> loadbalancer. After an update on CentOS 6.3 there is some mismatch
> of the node status. Via crm_mon on one node eveything looks fine
> while on the other node everything is offline. Everything was fine
> on CentOS 6.2.
> 
> Node powerpound:
> 
> ============
> Last updated: Fri Jul 20 12:04:29 2012
> Last change: Thu Jul 19 17:58:31 2012 via crm_attribute on pilotpound
> Stack: openais
> Current DC: powerpound - partition with quorum
> Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
> 2 Nodes configured, 2 expected votes
> 7 Resources configured.
> ============
> 
> Online: [ powerpound pilotpound ]
> 
> HA_IP_1 (ocf::heartbeat:IPaddr2):       Started powerpound
> HA_IP_2 (ocf::heartbeat:IPaddr2):       Started powerpound
> HA_IP_3 (ocf::heartbeat:IPaddr2):       Started powerpound
> HA_IP_4 (ocf::heartbeat:IPaddr2):       Started powerpound
> HA_IP_5 (ocf::heartbeat:IPaddr2):       Started powerpound
>  Clone Set: pingclone [ping-gateway]
>      Started: [ pilotpound powerpound ]
> 
> 
> Node pilotpound:
> 
> ============
> Last updated: Fri Jul 20 12:04:32 2012
> Last change: Thu Jul 19 17:58:17 2012 via crm_attribute on pilotpound
> Stack: openais
> Current DC: NONE
> 2 Nodes configured, 2 expected votes
> 7 Resources configured.
> ============
> 
> OFFLINE: [ powerpound pilotpound ]
> 
> 
> 
> 
> 
> from /var/log/messages on pilotpound:
> 
> Jul 20 12:06:12 pilotpound cib[24755]:  warning: cib_peer_callback:
> Discarding cib_apply_diff message (35909) from powerpound: not in
> our mem          bership
> Jul 20 12:06:12 pilotpound cib[24755]:  warning: cib_peer_callback:
> Discarding cib_apply_diff message (35910) from powerpound: not in
> our mem          bership
> 
> 
> 
> how could this happened and what can I do to solve this problem ?

Pretty sure it had nothing to do with upgrade - I had this the other day on Ubuntu 12.04 after a reboot of both nodes.  I believe a couple experts called it a "transient" bug.  See:
https://bugzilla.redhat.com/show_bug.cgi?id=820821
https://bugzilla.redhat.com/show_bug.cgi?id=5040

> 
> Any suggestions are welcome

I fixed by stopping/killing pacemaker/corosync on offending node (pilotpound).  Then cleared these files out on same node:
rm /var/lib/heartbeat/crm/cib*
rm /var/lib/pengine/*

Then restart corosync/pacemaker and the node rejoined fine.

HTH

Jake