[Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb

Wed Aug 21 10:07:58 EDT 2013

Am 19.08.2013 16:10, schrieb Jake Smith:
>> -----Original Message-----
>> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
>> Sent: Friday, August 16, 2013 10:31 PM
>> To: pacemaker at oss.clusterlabs.org
>> Subject: Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting
> o2cb
>>
>>
>> Am 16.08.2013 15:46, schrieb Jake Smith:
>>>> -----Original Message-----
>>>> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
>>>> Sent: Friday, August 16, 2013 9:05 AM
>>>> To: The Pacemaker cluster resource manager
>>>> Subject: [Pacemaker] Dual primary drbd + ocfs2: problems starting
>>>> o2cb
>>>>
>>>> Hi all,
>>>>
>>>> i'm working on a two node pacemaker cluster with dual primary drbd
>>>> and ocfs2.
>>>>
>>>> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting,
>>>> reading, writing, everything...).
>>>>
>>>> When i try to make this work in pacemaker, there seems to be a
>>>> problem
>>> to
>>>> start the o2cb resource.
>>>>
>>>> My (already simplified) configuration is:
>>>> -----------------------------------------
>>>> node poc1 \
>>>> 	attributes standby="off"
>>>> node poc2 \
>>>> 	attributes standby="off"
>>>> primitive res_dlm ocf:pacemaker:controld \
>>>> 	op monitor interval="120"
>>>> primitive res_drbd ocf:linbit:drbd \
>>>> 	params drbd_resource="r0" \
>>>> 	op stop interval="0" timeout="100" \
>>>> 	op start interval="0" timeout="240" \
>>>> 	op promote interval="0" timeout="90" \
>>>> 	op demote interval="0" timeout="90" \
>>>> 	op notifiy interval="0" timeout="90" \
>>>> 	op monitor interval="40" role="Slave" timeout="20" \
>>>> 	op monitor interval="20" role="Master" timeout="20"
>>>> primitive res_o2cb ocf:pacemaker:o2cb \
>>>> 	op monitor interval="60"
>>>> ms ms_drbd res_drbd \
>>>> 	meta notify="true" master-max="2" master-node-max="1" target-
>>>> role="Started"
>>>> property $id="cib-bootstrap-options" \
>>>> 	no-quorum-policy="ignore" \
>>>> 	stonith-enabled="false" \
>>>> 	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>>>> 	cluster-infrastructure="openais" \
>>>> 	expected-quorum-votes="2" \
>>>> 	last-lrm-refresh="1376574860"
>>>>
>>>
>>> Looks like you are missing ordering and colocation and clone (even
>>> group to make it a shorter config; group = order and colocation in one
>>> statement) statements.  The resources *must* start in a particular
>>> order and they much run on the same node and there must be an instance
>>> of each resource on each node.
>>>
>>> More here for DRBD 8.4:
>>> http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html
>>> Or DRBD 8.3:
>>> http://www.drbd.org/users-guide-8.3/s-ocfs2-pacemaker.html
>>>
>>> Basically add:
>>> Group grp_dlm_o2cb res_dlm res_o2cb
>>> Clone cl_dlm_o2cb grp_dlm_o2cb meta interleave=true Order
>>> ord_drbd_then_dlm_o2cb  res_drbd:promote cl_dlm_o2cb:start
>> Colocation
>>> col_dlm_o2cb_with_drbdmaster cl_dlm_o2cb res_drbd:Master
>>>
>>> HTH
>>>
>>> Jake
>>>
>>
>> Hello Jake,
>>
>> thanks for your reply. I already had res_dlm and res_o2cb grouped
> together
>> and cloned like in your advice; indeed this was my initial
> configuration. But
>> the problem showed up, so i tried to simplify the configuration to
> reduce
>> possible error sources.
>>
>> But now it seems i found a solution; or at least a workaround: i just
> use the
>> LSB resource agent lsb:o2cb. This one works! The resource starts without
> a
>> problem on both nodes and as far as i can see right now everything is
> fine
>> (tried with and without additional group and clone resource).
>>
>> Don't know if this will bring some drawbacks in the future; but for the
>> moment my problem seems to be solved.
>
> Not sure either - usually resource agents are more robust than simple LSB.
> I would also verify that the o2cb LSB is fully LSB compliant or your
> cluster will have issues
>
>>
>> Currently it seems to me that there's a subtle problem with the
>> ocf:pacemaker:o2cb resource agent; at least on my system.
>
> Maybe, maybe not - if you take a look at the o2cb resource agent the error
> message you were getting is after trying to start
> /usr/sbin/ocfs2_controld.pcmk for 10 seconds without success... I would
> time starting o2cb.  Might be as simple as allowing more time for startup
> of the daemon.
> I've not setup ocfs2 in a while but I believe you may be able to extend
> that timeout in the meta of the primitive without having to muck with the
> actual resource agent.
>
> Jake

Hello Jake,

yes, i think thats possible as you wrote. This is what ra meta 
ocf:pacemaker:o2cb says:

daemon_timeout (string, [10]): Daemon Timeout
     Number of seconds to allow the control daemon to come up

Thanks for the hint. I'll check that out when possible and see if it 
changes the behaviour. Currently i'm fine with lsb:o2cb...

regards

>
>>
>> Anyway, thanks a lot for your answer..!
>> Best regards
>> elmar
>>
>>
>>>
>>>> First error message in corosync.log as far as i can identify it:
>>>> ----------------------------------------------------------------
>>>> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr)
> dlm_controld.pcmk:
>>>> no process found
>>>> [ other stuff ]
>>>> lrmd: [5547]: info: RA output: (res_dlm:start:stderr)
> dlm_controld.pcmk:
>>>> no process found
>>>> [ other stuff ]
>>>>     lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
>>>> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
>>>>
>>>> (
>>>> You can find the whole corosync logfile (starting corosync on node 1
>>> from
>>>> beginning until after starting of resources) on:
>>>> http://www.marschke.info/corosync_drei.log
>>>> )
>>>>
>>>> syslog shows:
>>>> -------------
>>>> ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not
>>>> exist
>>>>
>>>>
>>>> Output of crm_mon:
>>>> ------------------
>>>> ============
>>>> Stack: openais
>>>> Current DC: poc1 - partition WITHOUT quorum
>>>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>>>> 2 Nodes configured, 2 expected votes
>>>> 4 Resources configured.
>>>> ============
>>>>
>>>> Online: [ poc1 ]
>>>> OFFLINE: [ poc2 ]
>>>>
>>>>     Master/Slave Set: ms_drbd [res_drbd]
>>>>         Masters: [ poc1 ]
>>>>         Stopped: [ res_drbd:1 ]
>>>>     res_dlm	(ocf::pacemaker:controld):	Started poc1
>>>>
>>>> Migration summary:
>>>> * Node poc1:
>>>>       res_o2cb: migration-threshold=1000000 fail-count=1000000
>>>>
>>>> Failed actions:
>>>>        res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete):
>>>> unknown error
>>>>
>>>> ---------------------------------------------------------------------
>>>> This is the situation after a reboot of node poc1. For simplification
>>>> i
>>> left
>>>> pacemaker / corosync unstarted on the second node, and already
>>>> removed a group and a clone resource where dlm and o2cb already had
>>>> been in
>>> (errors
>>>> were there also).
>>>>
>>>> Is my configuration of the resource agents correct?
>>>> I checked using "ra meta ...", but as far as i recognized everything
>>>> is
>>> ok.
>>>>
>>>> Is some piece of software missing?
>>>> dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
>>>> available, i even did additional links in /usr/sbin:
>>>> root at poc1:~# which ocfs2_controld.pcmk /usr/sbin/ocfs2_controld.pcmk
>>>> root at poc1:~# which dlm_controld.pcmk /usr/sbin/dlm_controld.pcmk
>>>> root at poc1:~#
>>>>
>>>> I already googled but couldn't find any useful. Thanks for any
>>> hints...:)
>>>>
>>>> kind regards
>>>> elmar
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>