[Pacemaker] Removed nodes showing back in status

Fri May 25 22:27:21 UTC 2012

On Fri, May 25, 2012 at 9:59 AM, Larry Brigman <larry.brigman at gmail.com> wrote:
> On Wed, May 16, 2012 at 1:53 PM, David Vossel <dvossel at redhat.com> wrote:
>> ----- Original Message -----
>>> From: "Larry Brigman" <larry.brigman at gmail.com>
>>> To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org>
>>> Sent: Monday, May 14, 2012 4:59:55 PM
>>> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>>
>>> On Mon, May 14, 2012 at 2:13 PM, David Vossel <dvossel at redhat.com>
>>> wrote:
>>> > ----- Original Message -----
>>> >> From: "Larry Brigman" <larry.brigman at gmail.com>
>>> >> To: "The Pacemaker cluster resource manager"
>>> >> <pacemaker at oss.clusterlabs.org>
>>> >> Sent: Monday, May 14, 2012 1:30:22 PM
>>> >> Subject: Re: [Pacemaker] Removed nodes showing back in status
>>> >>
>>> >> On Mon, May 14, 2012 at 9:54 AM, Larry Brigman
>>> >> <larry.brigman at gmail.com> wrote:
>>> >> > I have a 5 node cluster (but it could be any number of nodes, 3
>>> >> > or
>>> >> > larger).
>>> >> > I am testing some scripts for node removal.
>>> >> > I remove a node from the cluster and everything looks correct
>>> >> > from
>>> >> > crm
>>> >> > status standpoint.
>>> >> > When I remove a second node, the first node that was removed now
>>> >> > shows back
>>> >> > in the crm status as off-line.  I'm following the guidelines
>>> >> > provided
>>> >> > in Pacemaker Explained docs.
>>> >> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Pacemaker_Explained/s-node-delete.html
>>> >> >
>>> >> > I believe this is a bug but want to put it out to the list to be
>>> >> > sure.
>>> >> > Versions.
>>> >> > RHEL5.7 x86_64
>>> >> > corosync-1.4.2
>>> >> > openais-1.1.3
>>> >> > pacemaker-1.1.5
>>> >> >
>>> >> > Status after first node removed
>>> >> > [root at portland-3 ~]# crm status
>>> >> > ============
>>> >> > Last updated: Mon May 14 08:42:04 2012
>>> >> > Stack: openais
>>> >> > Current DC: portland-1 - partition with quorum
>>> >> > Version: 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>> >> > 4 Nodes configured, 4 expected votes
>>> >> > 0 Resources configured.
>>> >> > ============
>>> >> >
>>> >> > Online: [ portland-1 portland-2 portland-3 portland-4 ]
>>> >> >
>>> >> > Status after second node removed.
>>> >> > [root at portland-3 ~]# crm status
>>> >> > ============
>>> >> > Last updated: Mon May 14 08:42:45 2012
>>> >> > Stack: openais
>>> >> > Current DC: portland-1 - partition with quorum
>>> >> > Version: 1.1.5-1.3.sme-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
>>> >> > 4 Nodes configured, 3 expected votes
>>> >> > 0 Resources configured.
>>> >> > ============
>>> >> >
>>> >> > Online: [ portland-1 portland-3 portland-4 ]
>>> >> > OFFLINE: [ portland-5 ]
>>> >> >
>>> >> > Both nodes were removed from the cluster from node 1.
>>> >>
>>> >> When I added a node back into the cluster the second node
>>> >> that was removed now shows as offline.
>>> >
>>> > The only time I've seen this sort of behavior is when I don't
>>> > completely shutdown corosync and pacemaker on the node I'm
>>> > removing before I delete it's configuration from the cib.  Are you
>>> > sure corosync and pacemaker are gone before you delete the node
>>> > from the cluster config?
>>>
>>> Well, I run service pacemaker stop and service corosync stop prior to
>>> doing
>>> the remove.  Since I am doing it all in a script it's possible that
>>> there
>>> is a race condition that I have just expose or the services are not
>>> fully down
>>> when the service script exits.
>>
>> Yep, If you are waiting for the service scripts to return I would expect it to be safe to remove the nodes at that point.
>>
>>> BTW, I'm running pacemaker as it's own process instead of being a
>>> child of
>>> corosync (if that makes a difference).
>>>
>>
>> This shouldn't matter.
>>
>> An hb_report of this will help us distinguish if this is a bug or not.
> Bug opened with the hb and crm reports.
> https://developerbugs.linuxfoundation.org/show_bug.cgi?id=2648
>

I just tried something that seem to point that things are still around somewhere
in the cib.  I stopped and pacemaker.  This causes both removed nodes
to show back in pacemaker as offline.  Looks like the cluster's from scratch
documentation to remove a node doesn't work correctly.

BTW which is the best place to file the bugs?  Clusterlabs or Linuxfoundations?