[Pacemaker] Help with OCFS2 / DLM Stability
Dejan Muhamedagic
dejanmm at fastmail.fm
Wed Mar 10 12:28:27 UTC 2010
Hi,
On Tue, Mar 09, 2010 at 11:37:02AM -0000, Darren.Mansell at opengi.co.uk wrote:
> Hi everyone.
>
>
>
> Further to some discussions a couple of weeks ago with regard to OCFS2
> on SLES 11 HAE I'm looking to finally nail this problem.
>
> We have a 3 node cluster that has a STONITH shootout every week. This
> morning one node got stuck in a state where it couldn't be fenced due
> the RSA not being responsive.
>
> I'm not sure if the problem is due to:
>
> * Network interruption causing Totem failures.
> * Java (Tomcat) processes falling over.
I suppose that those are activequote and activequoteadmin. You
should increase the timeouts, 10 seconds is too short in general,
and for java/tomcat probably even more so.
> * DLM falling over.
> * Any of the above in any combination.
>
> I've attached a hb_report. Could you see if you can see anything?
Any good reason to ignore quorum? For a three node cluster you
should remove the no-quorum-policy property or, perhaps because
of ocfs2, set it to freeze.
Pacemaker is 1.0.3, perhaps it's time to upgrade too. There is a
SLE11 HAE update available.
More information about the Pacemaker
mailing list