[Pacemaker] abrupt power failure problem

Tue Jun 15 20:08:37 UTC 2010

Hi,

On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote:
> I've recently had exactly the same thing happen. One (highly kludgey!)
> solution I've considered is hacking a custom version of the stonith IPMI
> agent that would check whether the node was at all reachable following a
> stonith failure via any of the cluster interfaces reported by
> cl_status (I have redundant network links), and then return true (i.e.
> pretend the stonith succeeded) if it isn't. Since this is basically the
> logic I would use if I were trying to debug the issue remotely, I don't
> see that this would be any worse. 
> 
> Besides the obvious (potential, however unlikely, for split-brain), is
> there any reason this approach wouldn't work?

That sounds like a reason good enough to me :) If you can't reach
the host, you cannot know its state.

Thanks,

Dejan

> On Tue, 15 Jun 2010 19:55:49 +0200
> Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:
> 
> > Hello Diane,
> > 
> > the problem is that pacemaker is not allowed to take over resources
> > until stonith succeeds, as it simply does not know about the state of
> > the other server. Lets assume the other node would still be up and
> > running, would have mounted a shared storage device an would write to
> > it, but would respond to network anymore. If pacemaker would now
> > mount this device again, you would get data corruption. To protect
> > you against that, it requires that stonith succeeds, or that you
> > manually solve that problem.
> > 
> > The only automatic solution would be a more reliable stonith device,
> > e.g. IPMI with an extra power supply for the IPMI card or a PDU.
> > 
> > Cheers,
> > Bernd
> > 
> > On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > > Thanks for the idea. Is there any way to automatically recover
> > > resources without manual intervention?
> > > 
> > > Diane
> > > 
> > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE
> > > PROPRIETARY MATERIAL and is thus for use only by the intended
> > > recipient. If you received this in error, please contact the sender
> > > and delete the e-mail and its attachments from all computers.
> > > 
> > > 
> > > -----Original Message-----
> > > From: Bernd Schubert [mailto:bs_lists at aakef.fastmail.fm]
> > > Sent: Tuesday, June 15, 2010 1:39 PM
> > > To: pacemaker at oss.clusterlabs.org
> > > Cc: Schaefer, Diane E
> > > Subject: Re: [Pacemaker] abrupt power failure problem
> > > 
> > > On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > > > Hi,
> > > >   We are having trouble with our two node cluster after one node
> > > >  experiences an abrupt power failure.  The resources do not seem
> > > > to start on the remaining node (ie DRBD resources do not promote
> > > > to master).  In the log we notice:
> > > >
> > > > Jan  8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd:
> > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > > returned 256 Jan 8 02:12:27 qpr4 stonithd: [6622]: CRIT:
> > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256
> > > > Jan  8 02:12:27 qpr4 stonithd: [5854]: info: failed to STONITH
> > > > node qpr3 with local device stonith0 (exitcode 5), gonna try the
> > > > next local device Jan  8 02:12:27 qpr4 stonithd: [5854]: info: we
> > > > can't manage qpr3, broadcast request to other nodes Jan 8
> > > > 02:13:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node
> > > > qpr3: optype=RESET, op_result=TIMEOUT
> > > >
> > > > Jan  8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd:
> > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > > returned 256 Jan 8 02:13:27 qpr4 stonithd: [6763]: CRIT:
> > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc 256
> > > > Jan  8 02:13:27 qpr4 stonithd: [5854]: info: failed to STONITH
> > > > node qpr3 with local device stonith0 (exitcode 5), gonna try the
> > > > next local device Jan  8 02:13:27 qpr4 stonithd: [5854]: info: we
> > > > can't manage qpr3, broadcast request to other nodes Jan 8
> > > > 02:14:27 qpr4 stonithd: [5854]: ERROR: Failed to STONITH the node
> > > > qpr3: optype=RESET, op_result=TIMEOUT
> > > 
> > > Without looking at your hb_report, this already looks pretty clear
> > > - this node tries to reset the other node using IPMI and that
> > > fails, of course, as the node to be reset is powered off.
> > > When we had that problem in the past, we simply temporarily removed
> > > the failed node from the pacemaker configuration: crm node remove
> > > <node-name>
> > > 
> > > 
> > > Cheers,
> > > Bernd
> > > 
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> -- 
> Dan Urist
> durist at ucar.edu
> 303-497-2459
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker