[Pacemaker] Help with OCFS2 / DLM Stability

Dejan Muhamedagic dejanmm at fastmail.fm
Wed Mar 10 12:28:27 UTC 2010


Hi,

On Tue, Mar 09, 2010 at 11:37:02AM -0000, Darren.Mansell at opengi.co.uk wrote:
> Hi everyone.
> 
>  
> 
> Further to some discussions a couple of weeks ago with regard to OCFS2
> on SLES 11 HAE I'm looking to finally nail this problem.
> 
> We have a 3 node cluster that has a STONITH shootout every week. This
> morning one node got stuck in a state where it couldn't be fenced due
> the RSA not being responsive.
> 
> I'm not sure if the problem is due to:
> 
> *         Network interruption causing Totem failures.
> *         Java (Tomcat) processes falling over.

I suppose that those are activequote and activequoteadmin. You
should increase the timeouts, 10 seconds is too short in general,
and for java/tomcat probably even more so.

> *         DLM falling over.
> *         Any of the above in any combination.
> 
> I've attached a hb_report. Could you see if you can see anything?

Any good reason to ignore quorum? For a three node cluster you
should remove the no-quorum-policy property or, perhaps because
of ocfs2, set it to freeze.

Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
SLE11 HAE update available.



More information about the Pacemaker mailing list