[ClusterLabs] How to cancel a fencing request?

Tue Apr 3 17:59:21 EDT 2018

On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 02 Apr 2018 09:02:24 -0500
> Ken Gaillot <kgaillot at redhat.com> wrote:
> > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
> 
> [...]
> > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > nodes
> > > > do not
> > > > attempt fencing simultaneously.  
> > > 
> > > I'm not sure to understand the doc correctly in regard with this
> > > property. Does
> > > pcmk_delay_max delay the request itself or the execution of the
> > > request?
> > > 
> > > In other words, is it:
> > > 
> > >   delay -> fence query -> fencing action
> > > 
> > > or 
> > > 
> > >   fence query -> delay -> fence action
> > > 
> > > ?
> > > 
> > > The first definition would solve this issue, but not the second.
> > > As I
> > > understand it, as soon as the fence query has been sent, the node
> > > status is
> > > "UNCLEAN (online)".  
> > 
> > The latter -- you're correct, the node is already unclean by that
> > time.
> > Since the stop did not succeed, the node must be fenced to continue
> > safely.
> 
> Thank you for this clarification.
> 
> Do you want to patch to add this clarification to the documentation ?

Sure, it never hurts :)

> 
> > > > > The first node did, but no FA was then able to fence the
> > > > > second
> > > > > one. So the
> > > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > > 
> > > > > We were able to fix the original ressource problem, but not
> > > > > to
> > > > > avoid the
> > > > > useless second node fencing.
> > > > > 
> > > > > My questions are:
> > > > > 
> > > > > 1. is it possible to cancel the fencing request 
> > > > > 2. is it possible reset the node status to "online" ?   
> > > > 
> > > > Not that I'm aware of.  
> > > 
> > > Argh!
> > > 
> > > ++  
> > 
> > You could fix the problem with the stopped service manually, then
> > run
> > "stonith_admin --confirm=<NODENAME>" (or higher-level tool
> > equivalent).
> > That tells the cluster that you took care of the issue yourself, so
> > fencing can be considered complete.
> 
> Oh, OK. I was wondering if it could help.
> 
> For the complete story, while I was working on this cluster, we tried
> first to
> "unfence" the node using "stonith_admin --unfence <nodename>"...and
> it actually
> rebooted the node (using fence_vmware_soap) without cleaning its
> status??
> 
> ...So we actually cleaned the status using "--confirm" after the
> complete
> reboot.
> 
> Thank you for this clarification again.
> 
> > The catch there is that the cluster will assume you stopped the
> > node,
> > and all services on it are stopped. That could potentially cause
> > some
> > headaches if it's not true. I'm guessing that if you unmanaged all
> > the
> > resources on it first, then confirmed fencing, the cluster would
> > detect
> > everything properly, then you could re-manage.
> 
> Good to know. Thanks again.
> 
-- 
Ken Gaillot <kgaillot at redhat.com>