[Pacemaker] abrupt power failure problem

Tue Jun 15 15:15:08 EDT 2010

I've recently had exactly the same thing happen. One (highly kludgey!)
solution I've considered is hacking a custom version of the stonith IPMI
agent that would check whether the node was at all reachable following a
stonith failure via any of the cluster interfaces reported by
cl_status (I have redundant network links), and then return true (i.e.
pretend the stonith succeeded) if it isn't. Since this is basically the
logic I would use if I were trying to debug the issue remotely, I don't
see that this would be any worse. 

Besides the obvious (potential, however unlikely, for split-brain), is
there any reason this approach wouldn't work?

On Tue, 15 Jun 2010 19:55:49 +0200
Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:

> Hello Diane,
> 
> the problem is that pacemaker is not allowed to take over resources
> until stonith succeeds, as it simply does not know about the state of
> the other server. Lets assume the other node would still be up and
> running, would have mounted a shared storage device an would write to
> it, but would respond to network anymore. If pacemaker would now
> mount this device again, you would get data corruption. To protect
> you against that, it requires that stonith succeeds, or that you
> manually solve that problem.
> 
> The only automatic solution would be a more reliable stonith device,
> e.g. IPMI with an extra power supply for the IPMI card or a PDU.
> 
> Cheers,
> Bernd
> 
> On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > Thanks for the idea. Is there any way to automatically recover
> > resources without manual intervention?
> > 
> > Diane
> > 
> > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE
> > PROPRIETARY MATERIAL and is thus for use only by the intended
> > recipient. If you received this in error, please contact the sender
> > and delete the e-mail and its attachments from all computers.
> > 
> > 
> > -----Original Message-----
> > From: Bernd Schubert [mailto:bs_lists at aakef.fastmail.fm]
> > Sent: Tuesday, June 15, 2010 1:39 PM
> > To: pacemaker at oss.clusterlabs.org
> > Cc: Schaefer, Diane E
> > Subject: Re: [Pacemaker] abrupt power failure problem
> > 
> > On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > > Hi,
> > >   We are having trouble with our two node cluster after one node
> > >  experiences an abrupt power failure.  The resources do not seem
> > > to start on the remaining node (ie DRBD resources do not promote
> > > to master).  In the log we notice:
> > >
> > > Jan  8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd:
> > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > returned 256 Jan 8 02:12:27 qpr4 stonithd: [6622]: CRIT:
> > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256
> > > Jan  8 02:12:27 qpr4 stonithd: [5854]: info: failed to STONITH
> > > node qpr3 with local device stonith0 (exitcode 5), gonna try the
> > > next local device Jan  8 02:12:27 qpr4 stonithd: [5854]: info: we
> > > can't manage qpr3, broadcast request to other nodes Jan 8
> > > 02:13:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node
> > > qpr3: optype=RESET, op_result=TIMEOUT
> > >
> > > Jan  8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd:
> > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > returned 256 Jan 8 02:13:27 qpr4 stonithd: [6763]: CRIT:
> > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256
> > > Jan  8 02:13:27 qpr4 stonithd: [5854]: info: failed to STONITH
> > > node qpr3 with local device stonith0 (exitcode 5), gonna try the
> > > next local device Jan  8 02:13:27 qpr4 stonithd: [5854]: info: we
> > > can't manage qpr3, broadcast request to other nodes Jan 8
> > > 02:14:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node
> > > qpr3: optype=RESET, op_result=TIMEOUT
> > 
> > Without looking at your hb_report, this already looks pretty clear
> > - this node tries to reset the other node using IPMI and that
> > fails, of course, as the node to be reset is powered off.
> > When we had that problem in the past, we simply temporarily removed
> > the failed node from the pacemaker configuration: crm node remove
> > <node-name>
> > 
> > 
> > Cheers,
> > Bernd
> > 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-- 
Dan Urist
durist at ucar.edu
303-497-2459