[Pacemaker] problem with pacemaker/corosync on CentOS 6.3

Sun Jul 29 23:41:27 UTC 2012

On Tue, Jul 24, 2012 at 11:13 PM,  <fatcharly at gmx.de> wrote:
> Hi,
>
> here are the results of the corosync status. Can´t find a problem there:
>
> pilotpound:
>
> [root at pilotpound ~]# corosync-cfgtool -s
> Printing ring status.
> Local node ID 425699520
> RING ID 0
>         id      = 192.168.95.25
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.20.245
>         status  = ring 1 active with no faults
> [root at pilotpound ~]# corosync-objctl | grep member
> runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1) ip(192.168.20.245)
> runtime.totem.pg.mrp.srp.members.425699520.join_count=1
> runtime.totem.pg.mrp.srp.members.425699520.status=joined
> runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1) ip(192.168.20.246)
> runtime.totem.pg.mrp.srp.members.442476736.join_count=1
> runtime.totem.pg.mrp.srp.members.442476736.status=joined
>
>
> powerpound:
>
> [root at powerpound ~]# corosync-cfgtool -s
> Printing ring status.
> Local node ID 442476736
> RING ID 0
>         id      = 192.168.95.26
>         status  = ring 0 active with no faults
> RING ID 1
>         id      = 192.168.20.246
>         status  = ring 1 active with no faults
> [root at powerpound ~]# corosync-objctl | grep member
> runtime.totem.pg.mrp.srp.members.442476736.ip=r(0) ip(192.168.95.26) r(1) ip(192.168.20.246)
> runtime.totem.pg.mrp.srp.members.442476736.join_count=1
> runtime.totem.pg.mrp.srp.members.442476736.status=joined
> runtime.totem.pg.mrp.srp.members.425699520.ip=r(0) ip(192.168.95.25) r(1) ip(192.168.20.245)
> runtime.totem.pg.mrp.srp.members.425699520.join_count=5
> runtime.totem.pg.mrp.srp.members.425699520.status=joined

That is almost certainly the two bugs Jake pointed out.
The good news is that upstream got to the bottom of the problem and it
is now fixed.

>
> So I think I´ve got to swollow the bitter pill and restart the whole cluster.
>
> I will report about the result.
>
> Kind regards
>
> fatcharly
>
>
> -------- Original-Nachricht --------
>> Datum: Fri, 20 Jul 2012 12:21:47 -0400 (EDT)
>> Von: Jake Smith <jsmith at argotec.com>
>> An: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
>> Betreff: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS 6.3
>
>>
>> ----- Original Message -----
>> > From: fatcharly at gmx.de
>> > To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster resource
>> manager" <pacemaker at oss.clusterlabs.org>
>> > Sent: Friday, July 20, 2012 11:50:52 AM
>> > Subject: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS 6.3
>> >
>> > Hi Jake,
>> >
>> > I erased the files as mentioned und started the services. This is
>> > what I get on pilotpound after crm_mon :
>> >
>> > ============
>> > Last updated: Fri Jul 20 17:45:58 2012
>> > Last change:
>> > Current DC: NONE
>> > 0 Nodes configured, unknown expected votes
>> > 0 Resources configured.
>> > ============
>> >
>> >
>> > Looks like the system didn´t joined the cluster.
>> >
>> > Any suggestions are welcome
>>
>> Oh maybe worth checking corosync membership and see what it says now:
>> http://www.hastexo.com/resources/hints-and-kinks/checking-corosync-cluster-membership
>>
>> >
>> > Kind regards
>> >
>> > fatharly
>> >
>> > ------- Original-Nachricht --------
>> > > Datum: Fri, 20 Jul 2012 10:49:15 -0400 (EDT)
>> > > Von: Jake Smith <jsmith at argotec.com>
>> > > An: The Pacemaker cluster resource manager
>> > > <pacemaker at oss.clusterlabs.org>
>> > > Betreff: Re: [Pacemaker] problem with pacemaker/corosync  on CentOS
>> > > 6.3
>> >
>> > >
>> > > ----- Original Message -----
>> > > > From: fatcharly at gmx.de
>> > > > To: pacemaker at oss.clusterlabs.org
>> > > > Sent: Friday, July 20, 2012 6:08:45 AM
>> > > > Subject: [Pacemaker] problem with pacemaker/corosync  on CentOS
>> > > > 6.3
>> > > >
>> > > > Hi,
>> > > >
>> > > > I´m using a pacemaker+corosync bundle to run a pound based
>> > > > loadbalancer. After an update on CentOS 6.3 there is some
>> > > > mismatch
>> > > > of the node status. Via crm_mon on one node eveything looks fine
>> > > > while on the other node everything is offline. Everything was
>> > > > fine
>> > > > on CentOS 6.2.
>> > > >
>> > > > Node powerpound:
>> > > >
>> > > > ============
>> > > > Last updated: Fri Jul 20 12:04:29 2012
>> > > > Last change: Thu Jul 19 17:58:31 2012 via crm_attribute on
>> > > > pilotpound
>> > > > Stack: openais
>> > > > Current DC: powerpound - partition with quorum
>> > > > Version: 1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14
>> > > > 2 Nodes configured, 2 expected votes
>> > > > 7 Resources configured.
>> > > > ============
>> > > >
>> > > > Online: [ powerpound pilotpound ]
>> > > >
>> > > > HA_IP_1 (ocf::heartbeat:IPaddr2):       Started powerpound
>> > > > HA_IP_2 (ocf::heartbeat:IPaddr2):       Started powerpound
>> > > > HA_IP_3 (ocf::heartbeat:IPaddr2):       Started powerpound
>> > > > HA_IP_4 (ocf::heartbeat:IPaddr2):       Started powerpound
>> > > > HA_IP_5 (ocf::heartbeat:IPaddr2):       Started powerpound
>> > > >  Clone Set: pingclone [ping-gateway]
>> > > >      Started: [ pilotpound powerpound ]
>> > > >
>> > > >
>> > > > Node pilotpound:
>> > > >
>> > > > ============
>> > > > Last updated: Fri Jul 20 12:04:32 2012
>> > > > Last change: Thu Jul 19 17:58:17 2012 via crm_attribute on
>> > > > pilotpound
>> > > > Stack: openais
>> > > > Current DC: NONE
>> > > > 2 Nodes configured, 2 expected votes
>> > > > 7 Resources configured.
>> > > > ============
>> > > >
>> > > > OFFLINE: [ powerpound pilotpound ]
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > from /var/log/messages on pilotpound:
>> > > >
>> > > > Jul 20 12:06:12 pilotpound cib[24755]:  warning:
>> > > > cib_peer_callback:
>> > > > Discarding cib_apply_diff message (35909) from powerpound: not in
>> > > > our mem          bership
>> > > > Jul 20 12:06:12 pilotpound cib[24755]:  warning:
>> > > > cib_peer_callback:
>> > > > Discarding cib_apply_diff message (35910) from powerpound: not in
>> > > > our mem          bership
>> > > >
>> > > >
>> > > >
>> > > > how could this happened and what can I do to solve this problem ?
>> > >
>> > > Pretty sure it had nothing to do with upgrade - I had this the
>> > > other day
>> > > on Ubuntu 12.04 after a reboot of both nodes.  I believe a couple
>> > > experts
>> > > called it a "transient" bug.  See:
>> > > https://bugzilla.redhat.com/show_bug.cgi?id=820821
>> > > https://bugzilla.redhat.com/show_bug.cgi?id=5040
>> > >
>> > > >
>> > > > Any suggestions are welcome
>> > >
>> > > I fixed by stopping/killing pacemaker/corosync on offending node
>> > > (pilotpound).  Then cleared these files out on same node:
>> > > rm /var/lib/heartbeat/crm/cib*
>> > > rm /var/lib/pengine/*
>> > >
>> > > Then restart corosync/pacemaker and the node rejoined fine.
>> > >
>> > > HTH
>> > >
>> > > Jake
>> > >
>> > > _______________________________________________
>> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> > >
>> > > Project Home: http://www.clusterlabs.org
>> > > Getting started:
>> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > > Bugs: http://bugs.clusterlabs.org
>> >
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org