[Pacemaker] startup problem DLM on ubuntu lucid

Sun Apr 25 07:15:53 EDT 2010

Am Samstag, 24. April 2010, um 17:27:42 schrieb Pål Simensen:
> Can you check your dmesg to see if DLM is segfaulting? I might be
> experiencing the same problem. If corosync is started at boot DLM
> segfaults, but if it's started manually everything is ok. Still trying to
> find out more about what is going on, and I sadly can't provide more
> information before Monday when I get to work. We did even try bootchart to
> see if that could provide some more information, but sadly no. We also
> changed the start order to corosync by renaming the init symlink to
> S98corosync, but that didn't work out either.

You are right, dlm is segfaulting and network is already up at that time.

[   15.654093] br53: port 1(vlan53) entering forwarding state
[   15.664083] br83: port 1(vlan83) entering forwarding state
...
[   46.979087] dlm_controld.pc[2533]: segfault at 0 ip 00007f30f7d68022 sp 
00007fffddf0e288 error 4 in libc-2.11.1.so[7f30f7ce5000+178000]

I rebuild the packages http://ppa.launchpad.net/ubuntu-ha/lucid-
cluster/ubuntu/pool/main/r/redhat-cluster on a freshly installed lucid VM but 
this didn't change anything. I even upgraded them to current 3.0.11 still 
segfaulting. So try and error seems not to work. Maybe someone with a little 
more understanding what's going on can do an educated guess?

TIA,
Oliver

> 
> On Sat, Apr 24, 2010 at 12:25 PM, Oliver Heinz <oheinz at fbihome.de> wrote:
> > Hi,
> > 
> > when rebooting my cluster nodes they won't bring up the ocfs2-fs because
> > of resDLM failing. When I issue a '/etc/init.d/pacemaker restart'
> > afterwards everything is fine.
> > 
> > The machine needs quite a while to bring up the (bonding) network
> > interfaces.
> > Do timeout values need to be adjusted? Or should I rather try to startup
> > pacemaker after the network is completely up?
> > 
> > 
> > my current config:
> > 
> > node server-c \
> > 
> >        attributes standby="off"
> > 
> > node server-d
> > primitive failover-ip ocf:heartbeat:IPaddr \
> > 
> >        params ip="192.168.5.150" \
> >        op monitor interval="10s"
> > 
> > primitive resDLM ocf:pacemaker:controld \
> > 
> >        op monitor interval="120s"
> > 
> > primitive resFS ocf:heartbeat:Filesystem \
> > 
> >        params device="/dev/mapper/data-data" directory="/srv/data"
> > 
> > fstype="ocfs2" \
> > 
> >        op monitor interval="120s"
> > 
> > primitive resO2CB ocf:pacemaker:o2cb \
> > 
> >        op monitor interval="120s"
> > 
> > clone cloneDLM resDLM \
> > 
> >        meta globally-unique="false" interleave="true"
> > 
> > clone cloneFS resFS \
> > 
> >        meta interleave="true" ordered="true"
> > 
> > clone cloneO2CB resO2CB \
> > 
> >        meta globally-unique="false" interleave="true"
> > 
> > colocation colFSO2CB inf: cloneFS cloneO2CB
> > colocation colO2CBDLM inf: cloneO2CB cloneDLM
> > order ordDLMO2CB 0: cloneDLM cloneO2CB
> > order ordO2CBFS 0: cloneO2CB cloneFS
> > property $id="cib-bootstrap-options" \
> > 
> >        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
> >        cluster-infrastructure="openais" \
> >        expected-quorum-votes="2" \
> >        stonith-enabled="false" \
> >        last-lrm-refresh="1272026744"
> > 
> > I tried something like
> > primitive resDLM ocf:pacemaker:controld \
> > 
> >        op start timeout="100s" \
> >        op monitor interval="120s"
> > 
> > but this didn't help.
> > 
> > 
> > 
> > 
> > 
> > TIA,
> > Oliver
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf