[ClusterLabs] Failing operations immediately when node is known to be down
kgaillot at redhat.com
Fri Apr 13 10:35:34 EDT 2018
On Tue, 2018-04-10 at 12:56 -0500, Ryan Thomas wrote:
> I’m trying to implement a HA solution which recovers very quickly
> when a node fails. It my configuration, when I reboot a node, I see
> in the logs that pacemaker realizes the node is down, and decides to
> move all resources to the surviving node. To do this, it initiates a
> ‘stop’ operation on each of the resources to perform the move. The
> ‘stop’ fails as expected after 20s (the default action timeout).
> However, in this case, with the node known to be down, I’d like to
> avoid this 20 second delay. The node is known to be down, so any
> operations sent to the node will fail. It would be nice if
> operations sent to a down node would immediately fail, thus reducing
> the time it takes the resource to be started on the surviving node.
> I do not want to reduce the timeout for the operation, because the
> timeout is sensible for when a resource moves due to a non-node-
> failure. Is there a way to accomplish this?
> Thanks for your help.
How are you rebooting -- cleanly (normal shutdown) or simulating a
failure (e.g. power button)?
In a normal shutdown, pacemaker will move all resources off the node
before it shuts down. These operations shouldn't fail, because the node
isn't down yet.
When a node fails, corosync should detect this and notify pacemaker.
Pacemaker will not try to execute any operations on a failed node.
Instead, it will fence it.
What log messages do you see from corosync and pacemaker indicating
that the node is down? Do you have fencing configured and tested?
Ken Gaillot <kgaillot at redhat.com>
More information about the Users