[ClusterLabs] How to cancel a fencing request?

Tue Apr 3 15:33:53 EDT 2018

On Mon, 02 Apr 2018 09:02:24 -0500
Ken Gaillot <kgaillot at redhat.com> wrote:
> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
> > On Sun, 1 Apr 2018 09:01:15 +0300
> > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
[...]
> > > In two-node cluster you can set pcmk_delay_max so that both nodes
> > > do not
> > > attempt fencing simultaneously.  
> > 
> > I'm not sure to understand the doc correctly in regard with this
> > property. Does
> > pcmk_delay_max delay the request itself or the execution of the
> > request?
> > 
> > In other words, is it:
> > 
> >   delay -> fence query -> fencing action
> > 
> > or 
> > 
> >   fence query -> delay -> fence action
> > 
> > ?
> > 
> > The first definition would solve this issue, but not the second. As I
> > understand it, as soon as the fence query has been sent, the node
> > status is
> > "UNCLEAN (online)".  
> 
> The latter -- you're correct, the node is already unclean by that time.
> Since the stop did not succeed, the node must be fenced to continue
> safely.

Thank you for this clarification.

Do you want to patch to add this clarification to the documentation ?

> > > > The first node did, but no FA was then able to fence the second
> > > > one. So the
> > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > 
> > > > We were able to fix the original ressource problem, but not to
> > > > avoid the
> > > > useless second node fencing.
> > > > 
> > > > My questions are:
> > > > 
> > > > 1. is it possible to cancel the fencing request 
> > > > 2. is it possible reset the node status to "online" ?   
> > > 
> > > Not that I'm aware of.  
> > 
> > Argh!
> > 
> > ++  
> 
> You could fix the problem with the stopped service manually, then run
> "stonith_admin --confirm=<NODENAME>" (or higher-level tool equivalent).
> That tells the cluster that you took care of the issue yourself, so
> fencing can be considered complete.

Oh, OK. I was wondering if it could help.

For the complete story, while I was working on this cluster, we tried first to
"unfence" the node using "stonith_admin --unfence <nodename>"...and it actually
rebooted the node (using fence_vmware_soap) without cleaning its status??

...So we actually cleaned the status using "--confirm" after the complete
reboot.

Thank you for this clarification again.

> The catch there is that the cluster will assume you stopped the node,
> and all services on it are stopped. That could potentially cause some
> headaches if it's not true. I'm guessing that if you unmanaged all the
> resources on it first, then confirmed fencing, the cluster would detect
> everything properly, then you could re-manage.

Good to know. Thanks again.