[Pacemaker] trying to set up sbd stonith

Lars Marowsky-Bree lmb at suse.de
Wed Feb 24 21:31:22 UTC 2010


On 2010-02-24T20:53:40, Sander van Vugt <mail at sandervanvugt.nl> wrote:

> Hi,
> 
> STONITH seems to be driving me to feel like SMITH lately, so after
> unsuccessful attempts to get drac5 and rackpdu to do their work, I'm now
> focusing on the external/sbd plugin. It doesn't work too well though, so
> if anyone can give me a hint, I would appreciate. 

You are missing the configuration in /etc/sysconfig/sbd, it seems.

Here're some notes that will be merged into the SP1 manual - they're my
rough notes, not yet refined by the documentation group:


> Which looks like they are trying to do a STONITH shootout? So I cleared
> that information, using sbd -d /dev/dm-0 <nodenames> clear which looks
> useful, but didn't fix the issue. (Neither did a bold sbd -d /dev/dm-0
> create).

Don't use dm-0 links, they are not stable.



= Storage protection =

The SLE HA cluster stack's highest priority is protecting the integrity
of data. This is achieved by preventing uncoordinated concurrent access
to data storage - such as mounting an ext3 file system more than once in
the cluster, but also preventing OCFS2 from being mounted if
coordination with other cluster nodes is not available. In a
well-functioning cluster, Pacemaker will detect if resources are active
beyond their concurrency limits and initiate recovery; further, its
policy engine will never exceed these limitations.

However, network partitioning or software malfunction could potentially
cause scenarios where several coordinators are elected. If this
so-called split brain scenario were allowed to unfold, data corruption
might occur. Hence, several layers of protection have been added to the
cluster stack to mitigate this.

IO fencing/STONITH is the primary component contributing to this goal,
since they ensure that, prior to storage activation, all other access is
terminated; cLVM2 exclusive activation or OCFS2 file locking support are
other mechanisms, protecting against administrative or application
faults. Combined appropriately for your setup, these can reliably
prevent split-brain scenarios from causing harm.

This chapter describes an IO fencing mechanism that leverages the
storage itself, following by a description of an additional layer of
protection to ensure exclusive storage access. These two mechanisms can
even be combined for higher levels of protection.


== Storage-based fencing ==

This section describes how scenarios where shared storage is used can
leverage said shared storage for very reliable I/O fencing and avoidance
of split-brain scenarios.

This mechanism has been used successfully with the Novell Cluster Suite
and is also available in a similar fashion for the SLE HA 11 product
using the "external/sbd" STONITH agent.

=== Description ===

In an environment where all nodes have access to shared storage, a small
(1MB) partition is formated for use with sbd. The daemon, once
configured, is brought online on each node before the rest of the
cluster stack is started, and terminated only after all other cluster
components have been shut down - ensuring that cluster resources are
never activated without sbd supervision.

The daemon automatically allocates one of the message slots on the
partition to itself, and constantly monitors it for messages to itself.
Upon receipt of a message, the daemon immediately complies with the
request, such as initiating a power-off or reboot cycle for fencing.

The daemon also constantly monitors connectivity to the storage device,
and commits suicide in case the partition becomes unreachable,
guaranteeing that it is not disconnected from fencing message. (If the
cluster data resides on the same logical unit in a different partition,
this is not an additional point of failure; the work-load would
terminate anyway if the storage connectivity was lost.)

Increased protection is offered through "watchdog" support. Modern
systems support a "hardware watchdog" that has to be updated by the
software client, or else the hardware will enforce a system restart.
This protects against failures of the sbd process itself, such as
dieing, or becoming stuck on an IO error.

=== Setup guide ===

==== Requirements ====

The environment must have shared storage reachable by all nodes. It is
recommended to create a 1MB partition at the start of the device; in the
rest of this text, this is referred to as "/dev/SBD", please substitute
your actual pathname (ie, "/dev/sdc1") for this below.

This shared storage segment must not make use of host-based RAID, cLVM2,
nor DRBD.

However, using storage-based RAID and multipathing is recommended for
increased reliability.

==== SBD partition ====

All these steps must be performed as root.

After having made very sure that this is indeed the device you want to
use, and does not hold any data you need - as the sbd command will
overwrite it without further requests for confirmation -, initialize the
sbd device:

# sbd -d /dev/SBD create

This will write a header to the device, and create slots for up to 255
nodes sharing this device with default timings.

If your sbd device resides on a multipath group, you may need to adjust
the timeouts sbd uses, as MPIO's path down detection can cause some
latency: after the msgwait timeout, the message is assumed to have been
delivered to the node. For multipath, this should be the time required
for MPIO to detect a path failure and switch to the next path. You may
have to test this in your environment. The node will perform suicide if
it has not updated the watchdog timer fast enough; the watchdog timeout
must be shorter than the msgwait timeout - half the value is a good
estimate. This can be specified when the SBD device is initialized:

# /usr/sbin/sbd -d /dev/SBD -4 $msgwait -1 $watchdogtimeout create

(All timeouts are in seconds.)

You can look at what was written to the device using:

# sbd -d /dev/SBD dump 
Header version     : 2
Number of slots    : 255
Sector size        : 512
Timeout (watchdog) : 5
Timeout (allocate) : 2
Timeout (loop)     : 1
Timeout (msgwait)  : 10

As you can see, the timeouts are also stored in the header, to ensure
that all participating nodes agree on them.

==== Setup the software watchdog ====

Additionally, it is highly recommended that you set up your Linux system
to use a watchdog. Please refer to the SLES manual for this step(?).

This involves loading the proper watchdog driver on system boot. On HP
hardware, this is the "hpwdt" module. For systems with a Intel TCO,
"iTCO_wdt" can be used. "softdog" is the most generic driver, but it is
recommended that you use one with actual hardware integration. (See
"drivers/watchdog" in the kernel package for a list of choices.)

==== Starting the sbd daemon ====

The sbd daemon is a critical piece of the cluster stack. It must always
be running when the cluster stack is up, or even when the rest of it has
crashed, so that it can be fenced.

The openais init script starts and stops SBD if configured; add the
following to /etc/sysconfig/sbd:

===
SBD_DEVICE="/dev/SBD"
# The next line enables the watchdog support:
SBD_OPTS="-W"
=== 

If the SBD device is not accessible, the daemon will fail to start and
inhibit openais startup.

Note: If the SBD device becomes inaccessible from a node, this could
cause the node to enter an infinite reboot cycle. That is technically
correct, but depending on your administrative policies, might be a
considered a nuisance. You may wish to not automatically start up
openais on boot in such cases.

Before proceeding, ensure that SBD has indeed started on all nodes
through "rcopenais restart".

=== Testing SBD ===

The command

# sbd -d /dev/SBD list

Will dump the node slots, and their current messages, from the sbd
device. You should see all cluster nodes that have ever been started
with sbd being listed there; most likely with the message slot showing
"clear".

You can now try sending a test message to one of the nodes:

# sbd -d /dev/SBD message nodea test

The node will acknowledge the receipt of the message in the system logs:

Aug 29 14:10:00 nodea sbd: [13412]: info: Received command test from nodeb

This confirms that SBD is indeed up and running on the node, and that it
is ready to receive messages.


==== Configuring the fencing resource ====

To complete the sbd setup, it is necessary to activate sbd as a
STONITH/fencing mechanism in the CIB as follows:

# crm
configure
property stonith-enabled="true"
property stonith-timeout="30s"
primitive stonith:external/sbd params sbd_device="/dev/SBD"
commit
quit

Note that since node slots are allocated automatically, no manual
hostlist needs to be defined.

The SBD mechanism is used instead of other fencing/stonith mechanisms;
please disable any others you might have configured before.

Once the resource has started, your cluster is now successfully
configured for shared-storage fencing, and will utilize this method in
case a node needs to be fenced.

[snip]


Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde





More information about the Pacemaker mailing list