[ClusterLabs] How to cancel a fencing request?
Ken Gaillot
kgaillot at redhat.com
Tue Apr 3 17:59:21 EDT 2018
On Tue, 2018-04-03 at 21:33 +0200, Jehan-Guillaume de Rorthais wrote:
> On Mon, 02 Apr 2018 09:02:24 -0500
> Ken Gaillot <kgaillot at redhat.com> wrote:
> > On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais
> > wrote:
> > > On Sun, 1 Apr 2018 09:01:15 +0300
> > > Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>
> [...]
> > > > In two-node cluster you can set pcmk_delay_max so that both
> > > > nodes
> > > > do not
> > > > attempt fencing simultaneously.
> > >
> > > I'm not sure to understand the doc correctly in regard with this
> > > property. Does
> > > pcmk_delay_max delay the request itself or the execution of the
> > > request?
> > >
> > > In other words, is it:
> > >
> > > delay -> fence query -> fencing action
> > >
> > > or
> > >
> > > fence query -> delay -> fence action
> > >
> > > ?
> > >
> > > The first definition would solve this issue, but not the second.
> > > As I
> > > understand it, as soon as the fence query has been sent, the node
> > > status is
> > > "UNCLEAN (online)".
> >
> > The latter -- you're correct, the node is already unclean by that
> > time.
> > Since the stop did not succeed, the node must be fenced to continue
> > safely.
>
> Thank you for this clarification.
>
> Do you want to patch to add this clarification to the documentation ?
Sure, it never hurts :)
>
> > > > > The first node did, but no FA was then able to fence the
> > > > > second
> > > > > one. So the
> > > > > node stayed DC and was reported as "UNCLEAN (online)".
> > > > >
> > > > > We were able to fix the original ressource problem, but not
> > > > > to
> > > > > avoid the
> > > > > useless second node fencing.
> > > > >
> > > > > My questions are:
> > > > >
> > > > > 1. is it possible to cancel the fencing request
> > > > > 2. is it possible reset the node status to "online" ?
> > > >
> > > > Not that I'm aware of.
> > >
> > > Argh!
> > >
> > > ++
> >
> > You could fix the problem with the stopped service manually, then
> > run
> > "stonith_admin --confirm=<NODENAME>" (or higher-level tool
> > equivalent).
> > That tells the cluster that you took care of the issue yourself, so
> > fencing can be considered complete.
>
> Oh, OK. I was wondering if it could help.
>
> For the complete story, while I was working on this cluster, we tried
> first to
> "unfence" the node using "stonith_admin --unfence <nodename>"...and
> it actually
> rebooted the node (using fence_vmware_soap) without cleaning its
> status??
>
> ...So we actually cleaned the status using "--confirm" after the
> complete
> reboot.
>
> Thank you for this clarification again.
>
> > The catch there is that the cluster will assume you stopped the
> > node,
> > and all services on it are stopped. That could potentially cause
> > some
> > headaches if it's not true. I'm guessing that if you unmanaged all
> > the
> > resources on it first, then confirmed fencing, the cluster would
> > detect
> > everything properly, then you could re-manage.
>
> Good to know. Thanks again.
>
--
Ken Gaillot <kgaillot at redhat.com>
More information about the Users
mailing list