[Pacemaker] Periodically appear non-existent nodes

Thu Apr 19 11:58:09 EDT 2012

----- Original Message -----
> From: "Vladislav Bogdanov" <bubble at hoster-ok.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Thursday, April 19, 2012 4:06:33 AM
> Subject: Re: [Pacemaker] Periodically appear non-existent nodes
> 
> 19.04.2012 11:24, Andreas Kurz wrote:
> > On 04/18/2012 11:46 PM, ruslan usifov wrote:
> >>
> >>
> >> 2012/4/18 Andreas Kurz <andreas at hastexo.com
> >> <mailto:andreas at hastexo.com>>
> >>
> >>     On 04/17/2012 09:31 PM, ruslan usifov wrote:
> >>     >
> >>     >
> >>     > 2012/4/17 Proskurin Kirill <k.proskurin at corp.mail.ru
> >>     <mailto:k.proskurin at corp.mail.ru>
> >>     > <mailto:k.proskurin at corp.mail.ru
> >>     > <mailto:k.proskurin at corp.mail.ru>>>
> >>     >
> >>     >     On 04/17/2012 03:46 PM, ruslan usifov wrote:
> >>     >
> >>     >         2012/4/17 Andreas Kurz <andreas at hastexo.com
> >>     <mailto:andreas at hastexo.com>
> >>     >         <mailto:andreas at hastexo.com
> >>     >         <mailto:andreas at hastexo.com>>
> >>     <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>
> >>     >         <mailto:andreas at hastexo.com
> >>     >         <mailto:andreas at hastexo.com>>>>
> >>     >
> >>     >
> >>     >            On 04/14/2012 11:14 PM, ruslan usifov wrote:
> >>     >             > Hello
> >>     >             >
> >>     >             > I remove 2 nodes from cluster, with follow
> >>     >             > sequence:
> >>     >             >
> >>     >             > crm_node --force -R <id of node1>
> >>     >             > crm_node --force -R <id of node2>
> >>     >             > cibadmin --delete --obj_type nodes --crm_xml
> >>     >             > '<node
> >>     >         uname="node1"/>'
> >>     >             > cibadmin --delete --obj_type status --crm_xml
> >>     '<node_state
> >>     >            uname="node1"/>'
> >>     >             > cibadmin --delete --obj_type nodes --crm_xml
> >>     >             > '<node
> >>     >         uname="node2"/>'
> >>     >             > cibadmin --delete --obj_type status --crm_xml
> >>     '<node_state
> >>     >            uname="node2"/>'
> >>     >             >
> >>     >             >
> >>     >             > Nodes after this deleted, but if for example i
> >>     >             > restart
> >>     >         (reboot)
> >>     >            one of
> >>     >             > existent nodes in working cluster, this
> >>     >             > deleted nodes
> >>     >         appear again in
> >>     >             > OFFLINE state
> >>     >
> >>     >
> >>     >     I have this problem some time ago.
> >>     >     I "solved" it something like that:
> >>     >
> >>     >     crm node delete NODENAME
> >>     >     crm_node --force --remove NODENAME
> >>     >     cibadmin --delete --obj_type nodes --crm_xml '<node
> >>     uname="NODENAME"/>'
> >>     >     cibadmin --delete --obj_type status --crm_xml
> >>     >     '<node_state
> >>     >     uname="NODENAME"/>'
> >>     >
> >>     >     --
> >>     >
> >>     >
> >>     > I do the same, but some times after cluster reconfiguration
> >>     > (node
> >>     failed
> >>     > due power supply failure) removed nodes appear again, and
> >>     > this happens
> >>     > 3-4 times
> >>
> >>     And the same behavior if you switch your cluster into
> >>     maintenance-mode
> >>     (to avoid service downtime) and stop/start pacemaker and
> >>     corosync
> >>     completely?
> >>
> >>
> >> We will have maintenance window at this Friday (20.04.2012) so
> >> after
> >> that i can report more info.
> > 
> > Of course, that is the safest option ... though you won't have a
> > service
> > downtime if you enable maintenance-mode prior to cluster restart.
> 
> Unless you are using DLM (CLVM, GFS2, OCFS2). Then you should not
> stop
> corosync - dlm_controld uses CPG.
> 
> And, DLM may use pacemaker parts for fencing (cib, attrd, stonith,
> depending on version).
> 
> > 
> >>
> >> PS: I had similar situation on other cluster some times ago, and
> >> there i
> >> fully restart cluster and problem reproduced. But after some
> >> time(about
> >> 1-2 week) not existent nodes have ceased to appear
> > 
> > Now that is really strange ... if that happens again, the
> > corosync/pacemaker log files would be really interesting to have a
> > look at.
> 
> I recall that is a known issue for a rather long time.
> One need to do a full (not rolling) restart to make node fully
> disappear.
> I checked this again not so long ago, and yes, node deletion does not
> work with current master branch (or very close to it) - it appears
> again
> after pacemaker restart on any other node.
> 
> May be it is because of lrmd cache, like with failed actions? It
> looks
> very similar to that.

Looks similar, but it shouldn't be related.

-- Vossel