[Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

Mon Aug 19 14:10:07 UTC 2013

> -----Original Message-----
> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
> Sent: Friday, August 16, 2013 10:31 PM
> To: pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting
o2cb
> 
> 
> Am 16.08.2013 15:46, schrieb Jake Smith:
> >> -----Original Message-----
> >> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
> >> Sent: Friday, August 16, 2013 9:05 AM
> >> To: The Pacemaker cluster resource manager
> >> Subject: [Pacemaker] Dual primary drbd + ocfs2: problems starting
> >> o2cb
> >>
> >> Hi all,
> >>
> >> i'm working on a two node pacemaker cluster with dual primary drbd
> >> and ocfs2.
> >>
> >> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting,
> >> reading, writing, everything...).
> >>
> >> When i try to make this work in pacemaker, there seems to be a
> >> problem
> > to
> >> start the o2cb resource.
> >>
> >> My (already simplified) configuration is:
> >> -----------------------------------------
> >> node poc1 \
> >> 	attributes standby="off"
> >> node poc2 \
> >> 	attributes standby="off"
> >> primitive res_dlm ocf:pacemaker:controld \
> >> 	op monitor interval="120"
> >> primitive res_drbd ocf:linbit:drbd \
> >> 	params drbd_resource="r0" \
> >> 	op stop interval="0" timeout="100" \
> >> 	op start interval="0" timeout="240" \
> >> 	op promote interval="0" timeout="90" \
> >> 	op demote interval="0" timeout="90" \
> >> 	op notifiy interval="0" timeout="90" \
> >> 	op monitor interval="40" role="Slave" timeout="20" \
> >> 	op monitor interval="20" role="Master" timeout="20"
> >> primitive res_o2cb ocf:pacemaker:o2cb \
> >> 	op monitor interval="60"
> >> ms ms_drbd res_drbd \
> >> 	meta notify="true" master-max="2" master-node-max="1" target-
> >> role="Started"
> >> property $id="cib-bootstrap-options" \
> >> 	no-quorum-policy="ignore" \
> >> 	stonith-enabled="false" \
> >> 	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> >> 	cluster-infrastructure="openais" \
> >> 	expected-quorum-votes="2" \
> >> 	last-lrm-refresh="1376574860"
> >>
> >
> > Looks like you are missing ordering and colocation and clone (even
> > group to make it a shorter config; group = order and colocation in one
> > statement) statements.  The resources *must* start in a particular
> > order and they much run on the same node and there must be an instance
> > of each resource on each node.
> >
> > More here for DRBD 8.4:
> > http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html
> > Or DRBD 8.3:
> > http://www.drbd.org/users-guide-8.3/s-ocfs2-pacemaker.html
> >
> > Basically add:
> > Group grp_dlm_o2cb res_dlm res_o2cb
> > Clone cl_dlm_o2cb grp_dlm_o2cb meta interleave=true Order
> > ord_drbd_then_dlm_o2cb  res_drbd:promote cl_dlm_o2cb:start
> Colocation
> > col_dlm_o2cb_with_drbdmaster cl_dlm_o2cb res_drbd:Master
> >
> > HTH
> >
> > Jake
> >
> 
> Hello Jake,
> 
> thanks for your reply. I already had res_dlm and res_o2cb grouped
together
> and cloned like in your advice; indeed this was my initial
configuration. But
> the problem showed up, so i tried to simplify the configuration to
reduce
> possible error sources.
> 
> But now it seems i found a solution; or at least a workaround: i just
use the
> LSB resource agent lsb:o2cb. This one works! The resource starts without
a
> problem on both nodes and as far as i can see right now everything is
fine
> (tried with and without additional group and clone resource).
> 
> Don't know if this will bring some drawbacks in the future; but for the
> moment my problem seems to be solved.

Not sure either - usually resource agents are more robust than simple LSB.
I would also verify that the o2cb LSB is fully LSB compliant or your
cluster will have issues

> 
> Currently it seems to me that there's a subtle problem with the
> ocf:pacemaker:o2cb resource agent; at least on my system.

Maybe, maybe not - if you take a look at the o2cb resource agent the error
message you were getting is after trying to start
/usr/sbin/ocfs2_controld.pcmk for 10 seconds without success... I would
time starting o2cb.  Might be as simple as allowing more time for startup
of the daemon.
I've not setup ocfs2 in a while but I believe you may be able to extend
that timeout in the meta of the primitive without having to muck with the
actual resource agent.

Jake

> 
> Anyway, thanks a lot for your answer..!
> Best regards
> elmar
> 
> 
> >
> >> First error message in corosync.log as far as i can identify it:
> >> ----------------------------------------------------------------
> >> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr)
dlm_controld.pcmk:
> >> no process found
> >> [ other stuff ]
> >> lrmd: [5547]: info: RA output: (res_dlm:start:stderr)
dlm_controld.pcmk:
> >> no process found
> >> [ other stuff ]
> >>    lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
> >> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
> >>
> >> (
> >> You can find the whole corosync logfile (starting corosync on node 1
> > from
> >> beginning until after starting of resources) on:
> >> http://www.marschke.info/corosync_drei.log
> >> )
> >>
> >> syslog shows:
> >> -------------
> >> ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not
> >> exist
> >>
> >>
> >> Output of crm_mon:
> >> ------------------
> >> ============
> >> Stack: openais
> >> Current DC: poc1 - partition WITHOUT quorum
> >> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> >> 2 Nodes configured, 2 expected votes
> >> 4 Resources configured.
> >> ============
> >>
> >> Online: [ poc1 ]
> >> OFFLINE: [ poc2 ]
> >>
> >>    Master/Slave Set: ms_drbd [res_drbd]
> >>        Masters: [ poc1 ]
> >>        Stopped: [ res_drbd:1 ]
> >>    res_dlm	(ocf::pacemaker:controld):	Started poc1
> >>
> >> Migration summary:
> >> * Node poc1:
> >>      res_o2cb: migration-threshold=1000000 fail-count=1000000
> >>
> >> Failed actions:
> >>       res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete):
> >> unknown error
> >>
> >> ---------------------------------------------------------------------
> >> This is the situation after a reboot of node poc1. For simplification
> >> i
> > left
> >> pacemaker / corosync unstarted on the second node, and already
> >> removed a group and a clone resource where dlm and o2cb already had
> >> been in
> > (errors
> >> were there also).
> >>
> >> Is my configuration of the resource agents correct?
> >> I checked using "ra meta ...", but as far as i recognized everything
> >> is
> > ok.
> >>
> >> Is some piece of software missing?
> >> dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
> >> available, i even did additional links in /usr/sbin:
> >> root at poc1:~# which ocfs2_controld.pcmk /usr/sbin/ocfs2_controld.pcmk
> >> root at poc1:~# which dlm_controld.pcmk /usr/sbin/dlm_controld.pcmk
> >> root at poc1:~#
> >>
> >> I already googled but couldn't find any useful. Thanks for any
> > hints...:)
> >>
> >> kind regards
> >> elmar
> >>
> >>
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org Getting started:
> >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://bugs.clusterlabs.org
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org