[Pacemaker] Questions regarding STONITH

Digimer lists at alteeve.ca
Wed Jun 26 11:48:13 EDT 2013


You have a few issues here.

First up, a node can *never* be expected to self-fence. You can
intentionally crash a node to see why this is a bad idea; 'echo c >
/proc/sysrq-trigger' (this will immediately crash a node!). You need to
make sure each node can reach the *other* node's iLO interface.

Next; yes, fencing when both nods are ok but the link is down is a
"quick-draw" scenario. The trick here is that, using IPMI/iLO/etc, there
is a possibility that one node can initiate the shut down of the other
node before they get forced off, causing both nodes to shut down. To
protect against this, you can set a delay (I use 15 seconds) that says
"If I need to fence node X, wait Y seconds before starting". This way,
in a network split, one node gets a 15s head start and this makes sure
that both nodes don't power off at the same time.

Third; Trying to build a cluster in two different locations (called a
"stretch cluster") is tricky because it's very hard to distinguish
between a failed location from a broken network connection between the
locations. In either case, a failed fence action will leave the cluster
hung (if properly configured), requiring human intervention.

digimer

On 06/26/2013 08:10 AM, Paul Walsh wrote:
> Situation:
> 
> Two HP DL380 G7 (with ILO3) servers running pacemaker and heartbeat
> (yes, I know it's deprecated but I haven't got round to using corosync
> yet) on RHEL 6.4 (NOT using the Red Hat Cluster suite). The servers are
> in different data centres.
> 
> node-a normally hosts MySQL  database for application (/var/lib/mysql is
> on a DRBD device)
> node-b normally presents filesystems via NFS to application web front-ends
> 
> Each node is able to assume the role of the other in the event of failure
> 
> We've had occasions where network connectivity has been disrupted,
> resulting in a split-brain DRBD device.  What I want to do is configure
> STONITH so that only one node will end up running (and take over the
> resources of the other until that node is re-started).  However, the
> ILOs of each server are on a protected VLAN so each node is unable to
> access the other's ILO for the purposes of killing the power (so I can't
> use something like "*ipmitool -I lanplus -U */adminuser/*-H node-a-ilo
> -a power off*" on node-b, for instance. A node can, however, power
> _itself_ off with *ipmitool power off* so I'm wondering if this is
> something stonithd can do (node-a tells stonithd on node-b to power off
> and vice-versa)?
> 
> Am I right in thinking the idea behind STONITH is a "quick draw" -
> whichever node reacts fastest manages to "kill" the other and survive?
> 
> What happens when communication is completely lost between the two - how
> does each try to shoot the other if there's no network link?
> 
> Paul
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 


-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?




More information about the Pacemaker mailing list