[Pacemaker] Problems with SBD fencing

Wed Aug 21 12:15:39 EDT 2013

In my case I should mention that stonithing works occasionally when the SBD
resource is defined on one node only, but not too often. Unfortunately I
can't seem to find a pattern when it's working or failing. What I'm curious
about is the following lines in the log file:
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node
slesha1n2i-u will be fenced because the node is no longer part of the
cluster
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning:
determine_online_status: Node slesha1n2i-u is unclean
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action:
Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling
Node slesha1n2i-u for STONITH
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move
 stonith_sbd   (Started slesha1n2i-u -> slesha1n1i-u)
 ...
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: te_fence_node:
Executing reboot fencing operation (24) on slesha1n2i-u (timeout=60000)
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice: handle_request:
Client crmd.8916.3144546f wants to fence (reboot) 'slesha1n2i-u' with
device '(any)'
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:   notice:
initiate_remote_stonith_op: Initiating remote operation reboot for
slesha1n2i-u: 8c00ff7b-2986-4b2a-8b4a-760e8346349b (0)
 Aug  1 12:00:01 slesha1n1i-u stonith-ng[8912]:    error: remote_op_done:
Operation reboot of slesha1n2i-u by slesha1n1i-u for
crmd.8916 at slesha1n1i-u.8c00ff7b: No route to host
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice:
tengine_stonith_callback: Stonith operation
3/24:3:0:8a0f32b2-f91c-4cdf-9cee-1ba9b6e187ab: No route to host (-113)
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice:
tengine_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No
route to host): aborting transition.
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: tengine_stonith_notify:
Peer slesha1n2i-u was not terminated (st_notify_fence) by slesha1n1i-u for
slesha1n1i-u: No route to host (ref=8c00ff7b-2986-4b2a-8b4a-760e8346349b)
by client crmd.8916
 Aug  1 12:00:01 slesha1n1i-u crmd[8916]:   notice: run_graph: Transition 3
(Complete=1, Pending=0, Fired=0, Skipped=5, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-warn-15.bz2): Stopped
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: unpack_config: On
loss of CCM Quorum: Ignore
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: pe_fence_node: Node
slesha1n2i-u will be fenced because the node is no longer part of the
cluster
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning:
determine_online_status: Node slesha1n2i-u is unclean
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: custom_action:
Action stonith_sbd_stop_0 on slesha1n2i-u is unrunnable (offline)
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:  warning: stage6: Scheduling
Node slesha1n2i-u for STONITH
 Aug  1 12:00:01 slesha1n1i-u pengine[8915]:   notice: LogActions: Move
 stonith_sbd   (Started slesha1n2i-u -> slesha1n1i-u)
 ...
 Aug  1 12:00:02 slesha1n1i-u crmd[8916]:   notice: too_many_st_failures:
Too many failures to fence slesha1n2i-u (11), giving up

What does this mean?
ne_stonith_callback: Stonith operation 3 for slesha1n2i-u failed (No route
to host): aborting transition.

Of course there is no route to the other host, as the network interface is
down on the other node. The SBD stonith operation shouldn't be dependent on
the network connection at all?

I have also been testing another case where I define the SBD resource on
both nodes (which is not recommended as I understand). In this case
stonithing works just fine - always. Thus SBD messaging must be working as
it should. I also tested to fence the other node with the sbd command, and
it always works. So I'm still confused why SBD stonithing does not work
when the resource is defined on one node only.

Regards
Jan C

On Wed, Aug 21, 2013 at 4:27 PM, Lars Marowsky-Bree <lmb at suse.com> wrote:

> On 2013-08-20T08:52:00, "Angel L. Mateo" <amateo at um.es> wrote:
>
> Sorry, I was on vacation for a few weeks, thus only chiming in now.
>
> Instead of the Linux-HA Wiki page, please look here for the
> documentation: https://github.com/l-mb/sbd/blob/master/man/sbd.8.pod
>
> (Or, on a system with sbd installed, simply type "man sbd")
>
> The most common problems for fencing failures with SBD:
>
> - Pacemaker's stonith-timeout is not long enough to account for sbd's
>   msgwait. It needs to be at least 50% larger. (Pacemaker uses some of
>   the stonith-timeout for the look-up phase, and it isn't available for
>   the actual fence request.)
>
> - The storage is not truly shared.
>
>   Then the node can't actually "see" the other, and will not be able to
>   find the messaging slot. Hence, fencing will fail.
>
> >       For me to work (ubuntu 12.04) I had to create /etc/sysconfig/sbd
> file with:
> >
> > SBD_DEVICE="/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1"
> > SBD_OPTS="-W"
> >
> >       and the resource configuration is
> >
> > primitive stonith_sbd stonith:external/sbd \
> >         params
> >
> sbd_device="/dev/disk/by-id/wwn-0x6006016009702500a4227a04c6b0e211-part1" \
> >         meta target-role="Started"
>
> In the newer versions, it is not necessary to have the "params" on the
> primitive anymore - it'll read the /etc/sysconfig/sbd file. Overriding
> that shouldn't be really necessary.
>
> I can assure you that sbd fencing is working fine in SLE HA 11 SP3, or
> my lab cluster would never complete a single fence successfully ;-)
>
>
> Regards,
>     Lars
>
> --
> Architect Storage/HA
> SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix
> Imendörffer, HRB 21284 (AG Nürnberg)
> "Experience is the name everyone gives to their mistakes." -- Oscar Wilde
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
mvh
Jan Christian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130821/717eb3a4/attachment-0003.html>