[Pacemaker] abrupt power failure problem

Tue Jun 15 18:41:30 EDT 2010

On Tuesday 15 June 2010, Dejan Muhamedagic wrote:
> Hi,
> 
> On Tue, Jun 15, 2010 at 02:25:51PM -0600, Dan Urist wrote:
> > On Tue, 15 Jun 2010 22:08:37 +0200
> >
> > Dejan Muhamedagic <dejanmm at fastmail.fm> wrote:
> > > Hi,
> > >
> > > On Tue, Jun 15, 2010 at 01:15:08PM -0600, Dan Urist wrote:
> > > > I've recently had exactly the same thing happen. One (highly
> > > > kludgey!) solution I've considered is hacking a custom version of
> > > > the stonith IPMI agent that would check whether the node was at all
> > > > reachable following a stonith failure via any of the cluster
> > > > interfaces reported by cl_status (I have redundant network links),
> > > > and then return true (i.e. pretend the stonith succeeded) if it
> > > > isn't. Since this is basically the logic I would use if I were
> > > > trying to debug the issue remotely, I don't see that this would be
> > > > any worse.
> > > >
> > > > Besides the obvious (potential, however unlikely, for split-brain),
> > > > is there any reason this approach wouldn't work?
> > >
> > > That sounds like a reason good enough to me :) If you can't reach
> > > the host, you cannot know its state.
> >
> > But in my case, if the live node can't reach the suspect node via its
> > public network interface, its private bonded interface, or its IPMI
> > card (I've added a ping test for that, to determine that it's actually
> > unreachable rather than just failing), it seems pretty reasonable for
> > me to assume it's really dead at that point.
> 
> Perhaps somebody just pulled the network cables. I understand
> that it's not unheard of.

The network driver also may have crashed. And if its shared-NIC-IPMI (*) the 
network driver also may have brought down IPMI. 
Of course, I also see the problem of a complete server failure and the need to 
to automatically recover it. Besides a better stonith device, the only 
solution I see for it, would be a new parameter to make pacemaker assume a 
node is dead, if not a single network access succceds even though pacemaker 
fails. Of course that should default to to off and probably only should be 
possible to enable by adding something like 
"really_enable_parameter = I-know-exactly-what-I-do-and-accept-possible-split-
brain-and-data-corruption".
For example with Lustres multiple-mount-protection split brain *shouldn't* be 
a problem, although I never trust a single feature only ;)

Cheers,
Bernd

PS: (*) Sales manager who buy those IPMI-shared-NIC solutions and people from 
companies who sell that, should be punished and should work rotational 24 
hours in server rooms and take over the IPMI reset part ;)