[Pacemaker] Query regarding configuring STONITH device

Thu Jul 5 04:23:51 EDT 2012

On Mon, Jul 02, 2012 at 10:42:54PM +0530, sachin garg wrote:
> Hi,
> 
> 
> On Mon, Jul 02, 2012 at 05:49:38PM +0530, sachin garg wrote:
> > Hi,
> >
> > I am using IPMI plugin for configuring STONITH with heartbeat cluster.
> > If a resource fails on one node then the other node STONITHs that node.
> But
> > when the failed node comes back after the reboot, the STONITH device
> itself
> > fails on the node which has started again. Logs indicate that IPMI start
> > operation returned 1 (i.e. unknown error).
> 
> >> Isn't there more in the logs, i.e. a specific reason?
> No just a one liner is present in traces. I went through IPMI script to
> understand that in what scenario it may return 1. There is juts one flow
> (see below), which indicates that execution of IPMI tool fails at start.
> But this doesn't happen If I start heartbeat manually and only happens upon
> reboot (I have a strict requirement to start heartbeat stack upon restart)
> 
> # Yet another convenience wrapper that invokes run_ipmitool, captures
> # its output, logs the output, returns either 0 (on success) or 1 (on
> # any error)
> do_ipmi() {
>     if outp=`run_ipmitool $*`; then
>         ha_log.sh debug "ipmitool output: `echo $outp`"
>         return 0
>     else
>         ha_log.sh err "error executing ipmitool: `echo $outp`"
>         return 1
>     fi
> 
> }
> 
> 
> > I suspect that this may be due
> > to some initialization delays at network level. But I am not sure about
> > this. What could be the best way to overcome this issue? I consider adding
> > a start delay to stonith device but can't say if that is the right
> > approach.
> 
> >>Happens only once after boot? Afterwards works fine? Strange.
> >>Well, it's arguably good practice not to start the cluster stack
> >>automatically on boot.
> I have a strict requirement to start heartbeat stack upon restart. Will
> adding a start delay help; although I have reasons to believe that it
> doesn't help.
> 
> 
> > Moreover, how should one configure start/monitor operation failure for a
> > STONITH device? I have currently configured pacemaker to fence the node if
> > start/monitor operation fails for STONITH device. Is this the right
> > configuration?
> 
> >> No. Nothing special needs to be configured.
> Let me rephrase my question: All my resources have been configured for
> fencing upon monitor failure.

Sounds extreme.

> So, should I configure fencing or restart for
> STONITH device. Since fencing action is taken out by STONITH device itself,
> thats why this question. Moreover, If I configure "fence" for stonith
> device start failure, I get one extra reboot but eventually the system
> recovers and there are no more failures.

That's an overkill. Note that the fencing device is not a SPOF.

> > And what should be the monitoring frequency for STONITH device?
> >>Take a look here http://clusterlabs.org/doc/crm_fencing.html
> Thanks for directing to the article. The article says that monitoring must
> happen only 2-3 times per hour.

I think it's more like once every 2-3 hours.

> But if I have got a SRS with the customer
> which says that any required failover must happen in 30 seconds. So, in an
> extreme scenario when fencing device itself fails, I won't be able to
> fulfill the terms of SRS. Please advice.

Fencing takes place once there has been a fault detected. If
stonith based reset fails too, then it's a second failure and the
cluster cannot protect from that.

Note that fencing, though indispensable, is commonly very seldom
needed. If your cluster is different, then you should perhaps
reconsider the design.

Thanks,

Dejan

> Thanks,
> 
> Dejan
> 
> > Regards
> 
> 
> 
> On Mon, Jul 2, 2012 at 5:15 PM, sachin garg <sachingarg2k1 at gmail.com> wrote:
> 
> > Hi,
> >
> > I am using IPMI plugin for configuring STONITH with heartbeat cluster.
> > If a resource fails on one node then the other node STONITHs that node.
> > But when the failed node comes back after the reboot, the STONITH device
> > itself fails on the node which has started again. Logs indicate that IPMI
> > start operation returned 1 (i.e. unknown error). I suspect that this may be
> > due to some initialization delays at network level. But I am not sure about
> > this. What could be the best way to overcome this issue? I consider adding
> > a start delay to stonith device but can't say if that is the right
> > approach.
> >
> > Moreover, how should one configure start/monitor operation failure for a
> > STONITH device? I have currently configured pacemaker to fence the node if
> > start/monitor operation fails for STONITH device. Is this the right
> > configuration?
> >
> > And what should be the monitoring frequency for STONITH device?
> >
> > Regards
> >

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org