[Pacemaker] pacemaker/heartbeat fails to stop - waiting for stonith?

Thu Oct 16 10:31:09 UTC 2008

Hi,

On Thu, Oct 16, 2008 at 11:45:07AM +0200, Raoul Bhatia [IPAX] wrote:
> hi,
> 
> i wanted to stop heartbeat to update install a new cib.xml.
> i (hopefully) killed (!) all relevant processes on node2 (wc02)
> and issued "/etc/init.d/heartbeat stop" on wc01.
> 
> all resources stopped fine. all, with the exception of "stonith":
> > Clone Set: DoFencing
> >     stonith_rackpdu:0   (stonith:external/rackpdu):     Started wc01
> >     stonith_rackpdu:1   (stonith:external/rackpdu):     Stopped 
> 
> 
> looking into the logfile, i find
> > ct 16 11:40:53 wc01 pengine: [4617]: WARN: process_pe_message: Transition 359: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/heartbeat/pengine/pe-warn-388.bz2
> > Oct 16 11:40:53 wc01 pengine: [4617]: info: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
> 
> and crm_verify -L the pe-warn files, i get
> > crm_verify[29970]: 2008/10/16_11:41:13 notice: StopRsc:   wc01  Stop stonith_rackpdu:0
> > crm_verify[29970]: 2008/10/16_11:41:13 WARN: stage6: Scheduling Node wc02 for STONITH

I don't see what was the reason cluster wants to shoot wc02.
The logs should say.

> i recently changed my stonith configuration so the currently active
> one is not working anymore. thats one reason i want to update my
> cib.xml.
> 
> in particular, stonith_rackpdu's hostlist must be updated to reflect
> the change from "wc0X" to "wc0X-neu".

Right. In general, if the environment changes in such a way that
some resources may fail, then one should put those resources into
unmanaged mode beforehand. Actually, it'd be best to first stop
those resources, do whatever changes you have to do, reconfigure
resources appropriately, then start them again.

> anyways, is there any reason to avoid a shutdown and wait for stonith
> to succeed?

It depends, I guess. You can take a look at the pengine
transitions.

> is the stonith agent broken?

Hard to say without logs, but probably not.

> is pacemaker broken? is any
> other part broken? is this behavior intended?

Yes, insofar that if a node needs to be fenced, then it (the
cluster) won't budge until that's done. Whether it is sensible to
fence that node is another matter.

If you find this behaviour unexpected, then please open a
bugzilla and attach a full report (hb_report).

Thanks,

Dejan

> cheers,
> raoul
> -- 
> ____________________________________________________________________
> DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
> Technischer Leiter
> 
> IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
> Barawitzkagasse 10/2/2/11           email.            office at ipax.at
> 1190 Wien                           tel.               +43 1 3670030
> FN 277995t HG Wien                  fax.            +43 1 3670030 15
> ____________________________________________________________________
> 
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker