[Pacemaker] abrupt power failure problem

Tue Jun 15 21:08:24 UTC 2010

Hi,

On Tue, Jun 15, 2010 at 02:25:51PM -0600, Dan Urist wrote:
> On Tue, 15 Jun 2010 22:08:37 +0200
> Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> 
> > Hi,
> > 
> > On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote:
> > > I've recently had exactly the same thing happen. One (highly
> > > kludgey!) solution I've considered is hacking a custom version of
> > > the stonith IPMI agent that would check whether the node was at all
> > > reachable following a stonith failure via any of the cluster
> > > interfaces reported by cl_status (I have redundant network links),
> > > and then return true (i.e. pretend the stonith succeeded) if it
> > > isn't. Since this is basically the logic I would use if I were
> > > trying to debug the issue remotely, I don't see that this would be
> > > any worse. 
> > > 
> > > Besides the obvious (potential, however unlikely, for split-brain),
> > > is there any reason this approach wouldn't work?
> > 
> > That sounds like a reason good enough to me :) If you can't reach
> > the host, you cannot know its state.
> >
> 
> But in my case, if the live node can't reach the suspect node via its
> public network interface, its private bonded interface, or its IPMI
> card (I've added a ping test for that, to determine that it's actually
> unreachable rather than just failing), it seems pretty reasonable for
> me to assume it's really dead at that point. 

Perhaps somebody just pulled the network cables. I understand
that it's not unheard of.

Thanks,

Dejan

> > Thanks,
> > 
> > Dejan
> > 
> > > On Tue, 15 Jun 2010 19:55:49 +0200
> > > Bernd Schubert <bs_lists at aakef.fastmail.fm> wrote:
> > > 
> > > > Hello Diane,
> > > > 
> > > > the problem is that pacemaker is not allowed to take over
> > > > resources until stonith succeeds, as it simply does not know
> > > > about the state of the other server. Lets assume the other node
> > > > would still be up and running, would have mounted a shared
> > > > storage device an would write to it, but would respond to network
> > > > anymore. If pacemaker would now mount this device again, you
> > > > would get data corruption. To protect you against that, it
> > > > requires that stonith succeeds, or that you manually solve that
> > > > problem.
> > > > 
> > > > The only automatic solution would be a more reliable stonith
> > > > device, e.g. IPMI with an extra power supply for the IPMI card or
> > > > a PDU.
> > > > 
> > > > Cheers,
> > > > Bernd
> > > > 
> > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > > > > Thanks for the idea. Is there any way to automatically recover
> > > > > resources without manual intervention?
> > > > > 
> > > > > Diane
> > > > > 
> > > > > THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE
> > > > > PROPRIETARY MATERIAL and is thus for use only by the intended
> > > > > recipient. If you received this in error, please contact the
> > > > > sender and delete the e-mail and its attachments from all
> > > > > computers.
> > > > > 
> > > > > 
> > > > > -----Original Message-----
> > > > > From: Bernd Schubert [mailto:bs_lists at aakef.fastmail.fm]
> > > > > Sent: Tuesday, June 15, 2010 1:39 PM
> > > > > To: pacemaker at oss.clusterlabs.org
> > > > > Cc: Schaefer, Diane E
> > > > > Subject: Re: [Pacemaker] abrupt power failure problem
> > > > > 
> > > > > On Tuesday 15 June 2010, Schaefer, Diane E wrote:
> > > > > > Hi,
> > > > > >   We are having trouble with our two node cluster after one
> > > > > > node experiences an abrupt power failure.  The resources do
> > > > > > not seem to start on the remaining node (ie DRBD resources do
> > > > > > not promote to master).  In the log we notice:
> > > > > >
> > > > > > Jan  8 02:12:27 qpr4 stonithd: [6622]: info: external_run_cmd:
> > > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > > > > returned 256 Jan 8 02:12:27 qpr4 stonithd: [6622]: CRIT:
> > > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc
> > > > > > 256 Jan  8 02:12:27 qpr4 stonithd: [5854]: info: failed to
> > > > > > STONITH node qpr3 with local device stonith0 (exitcode 5),
> > > > > > gonna try the next local device Jan  8 02:12:27 qpr4
> > > > > > stonithd: [5854]: info: we can't manage qpr3, broadcast
> > > > > > request to other nodes Jan 8 02:13:27 qpr4 stonithd: [5854]:
> > > > > > ERROR: Failed to STONITH the node qpr3: optype=RESET,
> > > > > > op_result=TIMEOUT
> > > > > >
> > > > > > Jan  8 02:13:27 qpr4 stonithd: [6763]: info: external_run_cmd:
> > > > > > Calling '/usr/lib64/stonith/plugins/external/ipmi reset qpr3'
> > > > > > returned 256 Jan 8 02:13:27 qpr4 stonithd: [6763]: CRIT:
> > > > > > external_reset_req: 'ipmi reset' for host qpr3 failed with rc
> > > > > > 256 Jan  8 02:13:27 qpr4 stonithd: [5854]: info: failed to
> > > > > > STONITH node qpr3 with local device stonith0 (exitcode 5),
> > > > > > gonna try the next local device Jan  8 02:13:27 qpr4
> > > > > > stonithd: [5854]: info: we can't manage qpr3, broadcast
> > > > > > request to other nodes Jan 8 02:14:27 qpr4 stonithd: [5854]:
> > > > > > ERROR: Failed to STONITH the node qpr3: optype=RESET,
> > > > > > op_result=TIMEOUT
> > > > > 
> > > > > Without looking at your hb_report, this already looks pretty
> > > > > clear
> > > > > - this node tries to reset the other node using IPMI and that
> > > > > fails, of course, as the node to be reset is powered off.
> > > > > When we had that problem in the past, we simply temporarily
> > > > > removed the failed node from the pacemaker configuration: crm
> > > > > node remove <node-name>
> > > > > 
> > > > > 
> > > > > Cheers,
> > > > > Bernd
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > > 
> > > > Project Home: http://www.clusterlabs.org
> > > > Getting started:
> > > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > > 
> > > 
> > > 
> > > -- 
> > > Dan Urist
> > > durist at ucar.edu
> > > 303-497-2459
> > > 
> > > _______________________________________________
> > > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > > 
> > > Project Home: http://www.clusterlabs.org
> > > Getting started:
> > > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> 
> 
> -- 
> Dan Urist
> durist at ucar.edu
> 303-497-2459
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker