[Pacemaker] startup problem DLM on ubuntu lucid

Mon Apr 26 05:17:26 EDT 2010

Am Montag, 26. April 2010 08:35:53 schrieb Andrew Beekhof:
> What versions of pacemaker and the dlm?
> What does the stack trace from the core look like?

I reinstalled the packages from http://ppa.launchpad.net/ubuntu-ha/lucid-
cluster

that's 3.0.7 for the dlm
and 1.0.8+hg15494 for pacemaker

report: http://users.fbihome.de/~oheinz/ha-cluster/report_1.tar.bz2
core-file: http://users.fbihome.de/~oheinz/ha-cluster/core.2606.bz2

I cc:ed  the ubuntu-ha list as it might be packaging related.

TIA,
Oliver

> 
> On Sun, Apr 25, 2010 at 1:15 PM, Oliver Heinz <oheinz at fbihome.de> wrote:
> > Am Samstag, 24. April 2010, um 17:27:42 schrieb Pål Simensen:
> >> Can you check your dmesg to see if DLM is segfaulting? I might be
> >> experiencing the same problem. If corosync is started at boot DLM
> >> segfaults, but if it's started manually everything is ok. Still trying
> >> to find out more about what is going on, and I sadly can't provide more
> >> information before Monday when I get to work. We did even try bootchart
> >> to see if that could provide some more information, but sadly no. We
> >> also changed the start order to corosync by renaming the init symlink
> >> to S98corosync, but that didn't work out either.
> > 
> > You are right, dlm is segfaulting and network is already up at that time.
> > 
> > [   15.654093] br53: port 1(vlan53) entering forwarding state
> > [   15.664083] br83: port 1(vlan83) entering forwarding state
> > ...
> > [   46.979087] dlm_controld.pc[2533]: segfault at 0 ip 00007f30f7d68022
> > sp 00007fffddf0e288 error 4 in libc-2.11.1.so[7f30f7ce5000+178000]
> > 
> > I rebuild the packages http://ppa.launchpad.net/ubuntu-ha/lucid-
> > cluster/ubuntu/pool/main/r/redhat-cluster on a freshly installed lucid VM
> > but this didn't change anything. I even upgraded them to current 3.0.11
> > still segfaulting. So try and error seems not to work. Maybe someone
> > with a little more understanding what's going on can do an educated
> > guess?
> > 
> > TIA,
> > Oliver
> > 
> >> On Sat, Apr 24, 2010 at 12:25 PM, Oliver Heinz <oheinz at fbihome.de> wrote:
> >> > Hi,
> >> > 
> >> > when rebooting my cluster nodes they won't bring up the ocfs2-fs
> >> > because of resDLM failing. When I issue a '/etc/init.d/pacemaker
> >> > restart' afterwards everything is fine.
> >> > 
> >> > The machine needs quite a while to bring up the (bonding) network
> >> > interfaces.
> >> > Do timeout values need to be adjusted? Or should I rather try to
> >> > startup pacemaker after the network is completely up?
> >> > 
> >> > 
> >> > my current config:
> >> > 
> >> > node server-c \
> >> > 
> >> >        attributes standby="off"
> >> > 
> >> > node server-d
> >> > primitive failover-ip ocf:heartbeat:IPaddr \
> >> > 
> >> >        params ip="192.168.5.150" \
> >> >        op monitor interval="10s"
> >> > 
> >> > primitive resDLM ocf:pacemaker:controld \
> >> > 
> >> >        op monitor interval="120s"
> >> > 
> >> > primitive resFS ocf:heartbeat:Filesystem \
> >> > 
> >> >        params device="/dev/mapper/data-data" directory="/srv/data"
> >> > 
> >> > fstype="ocfs2" \
> >> > 
> >> >        op monitor interval="120s"
> >> > 
> >> > primitive resO2CB ocf:pacemaker:o2cb \
> >> > 
> >> >        op monitor interval="120s"
> >> > 
> >> > clone cloneDLM resDLM \
> >> > 
> >> >        meta globally-unique="false" interleave="true"
> >> > 
> >> > clone cloneFS resFS \
> >> > 
> >> >        meta interleave="true" ordered="true"
> >> > 
> >> > clone cloneO2CB resO2CB \
> >> > 
> >> >        meta globally-unique="false" interleave="true"
> >> > 
> >> > colocation colFSO2CB inf: cloneFS cloneO2CB
> >> > colocation colO2CBDLM inf: cloneO2CB cloneDLM
> >> > order ordDLMO2CB 0: cloneDLM cloneO2CB
> >> > order ordO2CBFS 0: cloneO2CB cloneFS
> >> > property $id="cib-bootstrap-options" \
> >> > 
> >> >        dc-version="1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd" \
> >> >        cluster-infrastructure="openais" \
> >> >        expected-quorum-votes="2" \
> >> >        stonith-enabled="false" \
> >> >        last-lrm-refresh="1272026744"
> >> > 
> >> > I tried something like
> >> > primitive resDLM ocf:pacemaker:controld \
> >> > 
> >> >        op start timeout="100s" \
> >> >        op monitor interval="120s"
> >> > 
> >> > but this didn't help.
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > TIA,
> >> > Oliver
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > 
> >> > _______________________________________________
> >> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >> > 
> >> > Project Home: http://www.clusterlabs.org
> >> > Getting started:
> >> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf