[Pacemaker] Removed nodes showing back in status

Larry Brigman larry.brigman at gmail.com
Fri Jun 8 16:46:52 EDT 2012


ping.  What can I do to assist in moving this bug forward to be fix?

On Wed, May 30, 2012 at 10:42 AM, Larry Brigman <larry.brigman at gmail.com> wrote:
> On Tue, May 29, 2012 at 3:08 PM, Larry Brigman <larry.brigman at gmail.com> wrote:
>> On Fri, May 25, 2012 at 3:40 PM, David Vossel <dvossel at redhat.com> wrote:
>>> ----- Original Message -----
>>>> From: "Larry Brigman" <larry.brigman at gmail.com>
>>>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>>> Sent: Friday, May 25, 2012 5:27:21 PM
>>>> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>>>
>>>> On Fri, May 25, 2012 at 9:59 AM, Larry Brigman
>>>> <larry.brigman at gmail.com> wrote:
>>>> > On Wed, May 16, 2012 at 1:53 PM, David Vossel <dvossel at redhat.com>
>>>> > wrote:
>>>> >> ----- Original Message -----
>>>> >>> From: "Larry Brigman" <larry.brigman at gmail.com>
>>>> >>> To: "The Pacemaker cluster resource manager"
>>>> >>> <pacemaker at oss.clusterlabs.org>
>>>> >>> Sent: Monday, May 14, 2012 4:59:55 PM
>>>> >>> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>>> >>>
>>>> >>> On Mon, May 14, 2012 at 2:13 PM, David Vossel
>>>> >>> <dvossel at redhat.com>
>>>> >>> wrote:
>>>> >>> > ----- Original Message -----
>>>> >>> >> From: "Larry Brigman" <larry.brigman at gmail.com>
>>>> >>> >> To: "The Pacemaker cluster resource manager"
>>>> >>> >> <pacemaker at oss.clusterlabs.org>
>>>> >>> >> Sent: Monday, May 14, 2012 1:30:22 PM
>>>> >>> >> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>>> >>> >>
>>>> >>> >> On Mon, May 14, 2012 at 9:54 AM, Larry Brigman
>>>> >>> >> <larry.brigman at gmail.com> wrote:
>>>> >>> >> > I have a 5 node cluster (but it could be any number of
>>>> >>> >> > nodes, 3
>>>> >>> >> > or
>>>> >>> >> > larger).
>>>> >>> >> > I am testing some scripts for node removal.
>>>> >>> >> > I remove a node from the cluster and everything looks
>>>> >>> >> > correct
>>>> >>> >> > from
>>>> >>> >> > crm
>>>> >>> >> > status standpoint.
>>>> >>> >> > When I remove a second node, the first node that was removed
>>>> >>> >> > now
>>>> >>> >> > shows back
>>>> >>> >> > in the crm status as off-line.  I'm following the guidelines
>>>> >>> >> > provided
>>>> >>> >> > in Pacemaker Explained docs.
>>>> >>> >> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-delete.html
>>>> >>> >> >
>>>> >>> >> > I believe this is a bug but want to put it out to the list
>>>> >>> >> > to be
>>>> >>> >> > sure.
>>>> >>> >> > Versions.
>>>> >>> >> > RHEL5.7 x86_64
>>>> >>> >> > corosync-1.4.2
>>>> >>> >> > openais-1.1.3
>>>> >>> >> > pacemaker-1.1.5
>>>> >>> >> >
>>>> >>> >> > Status after first node removed
>>>> >>> >> > [root at portland-3 ~]# crm status
>>>> >>> >> > ============
>>>> >>> >> > Last updated: Mon May 14 08:42:04 2012
>>>> >>> >> > Stack: openais
>>>> >>> >> > Current DC: portland-1 - partition with quorum
>>>> >>> >> > Version:
>>>> >>> >> > 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>>> >>> >> > 4 Nodes configured, 4 expected votes
>>>> >>> >> > 0 Resources configured.
>>>> >>> >> > ============
>>>> >>> >> >
>>>> >>> >> > Online: [ portland-1 portland-2 portland-3 portland-4 ]
>>>> >>> >> >
>>>> >>> >> > Status after second node removed.
>>>> >>> >> > [root at portland-3 ~]# crm status
>>>> >>> >> > ============
>>>> >>> >> > Last updated: Mon May 14 08:42:45 2012
>>>> >>> >> > Stack: openais
>>>> >>> >> > Current DC: portland-1 - partition with quorum
>>>> >>> >> > Version:
>>>> >>> >> > 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>>> >>> >> > 4 Nodes configured, 3 expected votes
>>>> >>> >> > 0 Resources configured.
>>>> >>> >> > ============
>>>> >>> >> >
>>>> >>> >> > Online: [ portland-1 portland-3 portland-4 ]
>>>> >>> >> > OFFLINE: [ portland-5 ]
>>>> >>> >> >
>>>> >>> >> > Both nodes were removed from the cluster from node 1.
>>>> >>> >>
>>>> >>> >> When I added a node back into the cluster the second node
>>>> >>> >> that was removed now shows as offline.
>>>> >>> >
>>>> >>> > The only time I've seen this sort of behavior is when I don't
>>>> >>> > completely shutdown corosync and pacemaker on the node I'm
>>>> >>> > removing before I delete it's configuration from the cib.  Are
>>>> >>> > you
>>>> >>> > sure corosync and pacemaker are gone before you delete the node
>>>> >>> > from the cluster config?
>>>> >>>
>>>> >>> Well, I run service pacemaker stop and service corosync stop
>>>> >>> prior to
>>>> >>> doing
>>>> >>> the remove.  Since I am doing it all in a script it's possible
>>>> >>> that
>>>> >>> there
>>>> >>> is a race condition that I have just expose or the services are
>>>> >>> not
>>>> >>> fully down
>>>> >>> when the service script exits.
>>>> >>
>>>> >> Yep, If you are waiting for the service scripts to return I would
>>>> >> expect it to be safe to remove the nodes at that point.
>>>> >>
>>>> >>> BTW, I'm running pacemaker as it's own process instead of being a
>>>> >>> child of
>>>> >>> corosync (if that makes a difference).
>>>> >>>
>>>> >>
>>>> >> This shouldn't matter.
>>>> >>
>>>> >> An hb_report of this will help us distinguish if this is a bug or
>>>> >> not.
>>>> > Bug opened with the hb and crm reports.
>>>> > https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2648
>>>> >
>>>>
>>>> I just tried something that seem to point that things are still
>>>> around somewhere
>>>> in the cib.  I stopped and pacemaker.  This causes both removed nodes
>>>> to show back in pacemaker as offline.  Looks like the cluster's from
>>>> scratch
>>>> documentation to remove a node doesn't work correctly.
>>>
>>> Interesting, thanks for generating the logs.  I'll look through them when I get a chance.
>>>
>>>> BTW which is the best place to file the bugs?  Clusterlabs or
>>>> Linuxfoundations?
>>>
>>> We are tracking pacemaker issues here, http://bugs.clusterlabs.org/. Please re-locate the issue.
>>
>> Done: http://bugs.clusterlabs.org/show_bug.cgi?id=5068
>
> Looks like any cib transition will cause the removed not to re-appear.
>
> What are the next steps that I can do to assist?




More information about the Pacemaker mailing list