[Pacemaker] Dual primary drbd + ocfs2: problems starting o2cb
Wen
deutschland.gray at gmail.com
Thu Aug 22 01:30:15 UTC 2013
Hi
I am also working in dual drbd
Now I met a issue which looks easy but can not solve.
I configure a clusterIP and has a clone of this resource.
ClusterIP 0: started node1
ClusterIP 1: started node2
Now I make node2 to standby
All the connections go to node1
Then I turn node2 to on line again.
But clusteraiP is still holding on node1 only never become active/ active .
Do u have the similar experience or
Your clusterIP can always come back to normal and running on two nodes after standby test?
What you want then follow your heart.
On 21 Aug, 2013, at 10:07 PM, Elmar Marschke <elmar.marschke at schenker.at> wrote:
>
> Am 19.08.2013 16:10, schrieb Jake Smith:
>>> -----Original Message-----
>>> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
>>> Sent: Friday, August 16, 2013 10:31 PM
>>> To: pacemaker at oss.clusterlabs.org
>>> Subject: Re: [Pacemaker] Dual primary drbd + ocfs2: problems starting
>> o2cb
>>>
>>>
>>> Am 16.08.2013 15:46, schrieb Jake Smith:
>>>>> -----Original Message-----
>>>>> From: Elmar Marschke [mailto:elmar.marschke at schenker.at]
>>>>> Sent: Friday, August 16, 2013 9:05 AM
>>>>> To: The Pacemaker cluster resource manager
>>>>> Subject: [Pacemaker] Dual primary drbd + ocfs2: problems starting
>>>>> o2cb
>>>>>
>>>>> Hi all,
>>>>>
>>>>> i'm working on a two node pacemaker cluster with dual primary drbd
>>>>> and ocfs2.
>>>>>
>>>>> Dual pri drbd and ocfs2 WITHOUT pacemaker work fine (mounting,
>>>>> reading, writing, everything...).
>>>>>
>>>>> When i try to make this work in pacemaker, there seems to be a
>>>>> problem
>>>> to
>>>>> start the o2cb resource.
>>>>>
>>>>> My (already simplified) configuration is:
>>>>> -----------------------------------------
>>>>> node poc1 \
>>>>> attributes standby="off"
>>>>> node poc2 \
>>>>> attributes standby="off"
>>>>> primitive res_dlm ocf:pacemaker:controld \
>>>>> op monitor interval="120"
>>>>> primitive res_drbd ocf:linbit:drbd \
>>>>> params drbd_resource="r0" \
>>>>> op stop interval="0" timeout="100" \
>>>>> op start interval="0" timeout="240" \
>>>>> op promote interval="0" timeout="90" \
>>>>> op demote interval="0" timeout="90" \
>>>>> op notifiy interval="0" timeout="90" \
>>>>> op monitor interval="40" role="Slave" timeout="20" \
>>>>> op monitor interval="20" role="Master" timeout="20"
>>>>> primitive res_o2cb ocf:pacemaker:o2cb \
>>>>> op monitor interval="60"
>>>>> ms ms_drbd res_drbd \
>>>>> meta notify="true" master-max="2" master-node-max="1" target-
>>>>> role="Started"
>>>>> property $id="cib-bootstrap-options" \
>>>>> no-quorum-policy="ignore" \
>>>>> stonith-enabled="false" \
>>>>> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
>>>>> cluster-infrastructure="openais" \
>>>>> expected-quorum-votes="2" \
>>>>> last-lrm-refresh="1376574860"
>>>>
>>>> Looks like you are missing ordering and colocation and clone (even
>>>> group to make it a shorter config; group = order and colocation in one
>>>> statement) statements. The resources *must* start in a particular
>>>> order and they much run on the same node and there must be an instance
>>>> of each resource on each node.
>>>>
>>>> More here for DRBD 8.4:
>>>> http://www.drbd.org/users-guide/s-ocfs2-pacemaker.html
>>>> Or DRBD 8.3:
>>>> http://www.drbd.org/users-guide-8.3/s-ocfs2-pacemaker.html
>>>>
>>>> Basically add:
>>>> Group grp_dlm_o2cb res_dlm res_o2cb
>>>> Clone cl_dlm_o2cb grp_dlm_o2cb meta interleave=true Order
>>>> ord_drbd_then_dlm_o2cb res_drbd:promote cl_dlm_o2cb:start
>>> Colocation
>>>> col_dlm_o2cb_with_drbdmaster cl_dlm_o2cb res_drbd:Master
>>>>
>>>> HTH
>>>>
>>>> Jake
>>>
>>> Hello Jake,
>>>
>>> thanks for your reply. I already had res_dlm and res_o2cb grouped
>> together
>>> and cloned like in your advice; indeed this was my initial
>> configuration. But
>>> the problem showed up, so i tried to simplify the configuration to
>> reduce
>>> possible error sources.
>>>
>>> But now it seems i found a solution; or at least a workaround: i just
>> use the
>>> LSB resource agent lsb:o2cb. This one works! The resource starts without
>> a
>>> problem on both nodes and as far as i can see right now everything is
>> fine
>>> (tried with and without additional group and clone resource).
>>>
>>> Don't know if this will bring some drawbacks in the future; but for the
>>> moment my problem seems to be solved.
>>
>> Not sure either - usually resource agents are more robust than simple LSB.
>> I would also verify that the o2cb LSB is fully LSB compliant or your
>> cluster will have issues
>>
>>>
>>> Currently it seems to me that there's a subtle problem with the
>>> ocf:pacemaker:o2cb resource agent; at least on my system.
>>
>> Maybe, maybe not - if you take a look at the o2cb resource agent the error
>> message you were getting is after trying to start
>> /usr/sbin/ocfs2_controld.pcmk for 10 seconds without success... I would
>> time starting o2cb. Might be as simple as allowing more time for startup
>> of the daemon.
>> I've not setup ocfs2 in a while but I believe you may be able to extend
>> that timeout in the meta of the primitive without having to muck with the
>> actual resource agent.
>>
>> Jake
>
> Hello Jake,
>
> yes, i think thats possible as you wrote. This is what ra meta ocf:pacemaker:o2cb says:
>
> daemon_timeout (string, [10]): Daemon Timeout
> Number of seconds to allow the control daemon to come up
>
> Thanks for the hint. I'll check that out when possible and see if it changes the behaviour. Currently i'm fine with lsb:o2cb...
>
> regards
>
>>
>>>
>>> Anyway, thanks a lot for your answer..!
>>> Best regards
>>> elmar
>>>
>>>
>>>>
>>>>> First error message in corosync.log as far as i can identify it:
>>>>> ----------------------------------------------------------------
>>>>> lrmd: [5547]: info: RA output: (res_dlm:probe:stderr)
>> dlm_controld.pcmk:
>>>>> no process found
>>>>> [ other stuff ]
>>>>> lrmd: [5547]: info: RA output: (res_dlm:start:stderr)
>> dlm_controld.pcmk:
>>>>> no process found
>>>>> [ other stuff ]
>>>>> lrmd: [5547]: info: RA output: (res_o2cb:start:stderr)
>>>>> 2013/08/16_13:25:20 ERROR: ocfs2_controld.pcmk did not come up
>>>>>
>>>>> (
>>>>> You can find the whole corosync logfile (starting corosync on node 1
>>>> from
>>>>> beginning until after starting of resources) on:
>>>>> http://www.marschke.info/corosync_drei.log
>>>>> )
>>>>>
>>>>> syslog shows:
>>>>> -------------
>>>>> ocfs2_controld.pcmk[5774]: Unable to connect to CKPT: Object does not
>>>>> exist
>>>>>
>>>>>
>>>>> Output of crm_mon:
>>>>> ------------------
>>>>> ============
>>>>> Stack: openais
>>>>> Current DC: poc1 - partition WITHOUT quorum
>>>>> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
>>>>> 2 Nodes configured, 2 expected votes
>>>>> 4 Resources configured.
>>>>> ============
>>>>>
>>>>> Online: [ poc1 ]
>>>>> OFFLINE: [ poc2 ]
>>>>>
>>>>> Master/Slave Set: ms_drbd [res_drbd]
>>>>> Masters: [ poc1 ]
>>>>> Stopped: [ res_drbd:1 ]
>>>>> res_dlm (ocf::pacemaker:controld): Started poc1
>>>>>
>>>>> Migration summary:
>>>>> * Node poc1:
>>>>> res_o2cb: migration-threshold=1000000 fail-count=1000000
>>>>>
>>>>> Failed actions:
>>>>> res_o2cb_start_0 (node=poc1, call=6, rc=1, status=complete):
>>>>> unknown error
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> This is the situation after a reboot of node poc1. For simplification
>>>>> i
>>>> left
>>>>> pacemaker / corosync unstarted on the second node, and already
>>>>> removed a group and a clone resource where dlm and o2cb already had
>>>>> been in
>>>> (errors
>>>>> were there also).
>>>>>
>>>>> Is my configuration of the resource agents correct?
>>>>> I checked using "ra meta ...", but as far as i recognized everything
>>>>> is
>>>> ok.
>>>>>
>>>>> Is some piece of software missing?
>>>>> dlm-pcmk is installed, ocfs2_controld.pcmk and dlm_controld.pcmk are
>>>>> available, i even did additional links in /usr/sbin:
>>>>> root at poc1:~# which ocfs2_controld.pcmk /usr/sbin/ocfs2_controld.pcmk
>>>>> root at poc1:~# which dlm_controld.pcmk /usr/sbin/dlm_controld.pcmk
>>>>> root at poc1:~#
>>>>>
>>>>> I already googled but couldn't find any useful. Thanks for any
>>>> hints...:)
>>>>>
>>>>> kind regards
>>>>> elmar
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list