[ClusterLabs] How to cancel a fencing request?

Tue Apr 3 01:36:31 EDT 2018

On 04/02/2018 04:02 PM, Ken Gaillot wrote:
> On Mon, 2018-04-02 at 10:54 +0200, Jehan-Guillaume de Rorthais wrote:
>> On Sun, 1 Apr 2018 09:01:15 +0300
>> Andrei Borzenkov <arvidjaar at gmail.com> wrote:
>>
>>> 31.03.2018 23:29, Jehan-Guillaume de Rorthais пишет:
>>>> Hi all,
>>>>
>>>> I experienced a problem in a two node cluster. It has one FA per
>>>> node and
>>>> location constraints to avoid the node each of them are supposed
>>>> to
>>>> interrupt. 
>>> If you mean stonith resource - for all I know location it does not
>>> affect stonith operations and only changes where monitoring action
>>> is
>>> performed.
>> Sure.
>>
>>> You can create two stonith resources and declare that each
>>> can fence only single node, but that is not location constraint, it
>>> is
>>> resource configuration. Showing your configuration would be
>>> helpflul to
>>> avoid guessing.
>> True, I should have done that. A conf worth thousands of words :)
>>
>>   crm conf<<EOC
>>
>>   primitive fence_vm_srv1 stonith:fence_virsh                   \
>>     params pcmk_host_check="static-list" pcmk_host_list="srv1"  \
>>            ipaddr="192.168.2.1" login="<user>"                  \
>>            identity_file="/root/.ssh/id_rsa"                    \
>>            port="srv1-d8" action="off"                          \
>>     op monitor interval=10s
>>
>>   location fence_vm_srv1-avoids-srv1 fence_vm_srv1 -inf: srv1
>>
>>   primitive fence_vm_srv2 stonith:fence_virsh                   \
>>     params pcmk_host_check="static-list" pcmk_host_list="srv2"  \
>>            ipaddr="192.168.2.1" login="<user>"                  \
>>            identity_file="/root/.ssh/id_rsa"                    \
>>            port="srv2-d8" action="off"                          \
>>     op monitor interval=10s
>>
>>   location fence_vm_srv2-avoids-srv2 fence_vm_srv2 -inf: srv2
>>   
>>   EOC
>>

-inf constraints like that should effectively prevent
stonith-actions from being executed on that nodes.
Though there are a few issues with location constraints
and stonith-devices.

When stonithd brings up the devices from the cib it
runs the parts of pengine that fully evaluate these
constraints and it would disable the stonith-device
if the resource is unrunable on that node.
But this part is not retriggered for location contraints
with attributes or other content that would dynamically
change. So one has to stick with constraints as simple
and static as those in the example above.

Regarding adding/removing location constraints dynamically
I remember a bug that should have got fixed round 1.1.18
that led to improper handling and actually usage of
stonith-devices disabled or banned from certain nodes.

Regards,
Klaus

>>>> During some tests, a ms resource raised an error during the stop
>>>> action on
>>>> both nodes. So both nodes were supposed to be fenced.
>>> In two-node cluster you can set pcmk_delay_max so that both nodes
>>> do not
>>> attempt fencing simultaneously.
>> I'm not sure to understand the doc correctly in regard with this
>> property. Does
>> pcmk_delay_max delay the request itself or the execution of the
>> request?
>>
>> In other words, is it:
>>
>>   delay -> fence query -> fencing action
>>
>> or 
>>
>>   fence query -> delay -> fence action
>>
>> ?
>>
>> The first definition would solve this issue, but not the second. As I
>> understand it, as soon as the fence query has been sent, the node
>> status is
>> "UNCLEAN (online)".
> The latter -- you're correct, the node is already unclean by that time.
> Since the stop did not succeed, the node must be fenced to continue
> safely.

Well, pcmk_delay_base/max are made for the case
where both nodes in a 2-node-cluster loose contact
and see the respectively other as unclean.
If the looser gets fenced it's view of the partner-
node becomes irrelevant.

>>>> The first node did, but no FA was then able to fence the second
>>>> one. So the
>>>> node stayed DC and was reported as "UNCLEAN (online)".
>>>>
>>>> We were able to fix the original ressource problem, but not to
>>>> avoid the
>>>> useless second node fencing.
>>>>
>>>> My questions are:
>>>>
>>>> 1. is it possible to cancel the fencing request 
>>>> 2. is it possible reset the node status to "online" ? 
>>> Not that I'm aware of.
>> Argh!
>>
>> ++
> You could fix the problem with the stopped service manually, then run
> "stonith_admin --confirm=<NODENAME>" (or higher-level tool equivalent).
> That tells the cluster that you took care of the issue yourself, so
> fencing can be considered complete.
>
> The catch there is that the cluster will assume you stopped the node,
> and all services on it are stopped. That could potentially cause some
> headaches if it's not true. I'm guessing that if you unmanaged all the
> resources on it first, then confirmed fencing, the cluster would detect
> everything properly, then you could re-manage.