[Pacemaker] (solved) Dual primary drbd resouce not promoted on one host
Jürgen Herrmann
Juergen.Herrmann at XLhost.de
Wed Feb 13 13:58:24 UTC 2013
Am 06.02.2013 00:00, schrieb Jake Smith:
> ----- Original Message -----
>> From: "Jürgen Herrmann" <Juergen.Herrmann at XLhost.de>
>> To: "Jake Smith" <jsmith at argotec.com>, "The Pacemaker cluster
>> resource manager" <pacemaker at oss.clusterlabs.org>
>> Sent: Tuesday, February 5, 2013 4:00:48 PM
>> Subject: Re: [Pacemaker] Dual primary drbd resouce not promoted on
>> one host
>>
>> Am 05.02.2013 16:32, schrieb Jake Smith:
>> > ----- Original Message -----
>> >> From: "Jürgen Herrmann" <Juergen.Herrmann at XLhost.de>
>> >> To: pacemaker at oss.clusterlabs.org
>> >> Sent: Tuesday, February 5, 2013 7:04:26 AM
>> >> Subject: [Pacemaker] Dual primary drbd resouce not promoted on
>> one
>> >> host
>> >>
>> >> Hi there!
>> >>
>> >> I have the following problem:
>> >>
>> >> I have a 2 node cluster with a dual primary drbd resource. On top
>> >> of it sits an OCFS2 file system. nodes: app1a, app1b
>> >>
>> >> Now today I had the following scenario (occurred several times
>> >> now):
>> >> - crm node standby app1a
>> >> - poweroff app1a for hdd replacement (hw raid controller)
>> >> - poweron app1a
>> >> - crm node online app1a
>> >>
>> >> all the other resources come back up as expecte, expect the
>> master
>> >> slave set for the dual primary drbd.
>> >>
>> >> here's the relevant portion of my cluster config:
>> >>
>> >> node app1a.xlhost.de \
>> >> attributes standby="off"
>> >> node app1b.xlhost.de \
>> >> attributes standby="off"
>> >> primitive resDLM ocf:pacemaker:controld \
>> >> op start interval="0" timeout="90s" \
>> >> op stop interval="0" timeout="100s" \
>> >> op monitor interval="120s"
>> >> primitive resDRBD0 ocf:linbit:drbd \
>> >> op monitor interval="23" role="Slave" timeout="30" \
>> >> op monitor interval="13" role="Master" timeout="20" \
>> >> op start interval="0" timeout="240s" \
>> >> op promote interval="0" timeout="240s" \
>> >> op demote interval="0" timeout="100s" \
>> >> op stop interval="0" timeout="100s" \
>> >> params drbd_resource="drbd0"
>> >> primitive resFSDRBD0 ocf:heartbeat:Filesystem \
>> >> params device="/dev/drbd0" directory="/mnt/drbd0"
>> >> fstype="ocfs2" options="noatime,intr,nodiratime,heartbeat=none" \
>> >> op monitor interval="120s" timeout="50s" \
>> >> op start interval="0" timeout="70s" \
>> >> op stop interval="0" timeout="70s"
>> >> primitive resO2CB ocf:pacemaker:o2cb \
>> >> op start interval="0" timeout="90s" \
>> >> op stop interval="0" timeout="100s" \
>> >> op monitor interval="120s"
>> >> ms msDRBD0 resDRBD0 \
>> >> meta master-max="2" master-node-max="1" clone-max="2"
>> >> clone-node-max="1" notify="true" target-role="Master"
>> >> clone cloneDLM resDLM \
>> >> meta globally-unique="false" interleave="true"
>> >> target-role="Started"
>> >> clone cloneFSDRBD0 resFSDRBD0 \
>> >> meta interleave="true" globally-unique="false"
>> >> target-role="Started"
>> >> clone cloneO2CB resO2CB \
>> >> meta globally-unique="false" interleave="true"
>> >> target-role="Started"
>> >> colocation colFSDRBD0_DRBD0 inf: cloneFSDRBD0 msDRBD0:Master
>> >
>> > ^^^ This colocation should be cloneDLM on msDRBD0.
>> >
>> >> colocation colFSDRBD0_O2CB inf: cloneFSDRBD0 cloneO2CB
>> >> colocation colO2CB_DLM inf: cloneO2CB cloneDLM
>> >> order ordDLM_FSDRBD0 inf: cloneDLM cloneFSDRBD0
>> >
>> > ^^^ This order statement is not needed.
>> >
>> >> order ordDLM_O2CB inf: cloneDLM cloneO2CB
>> >> order ordDRBD0_FSDRBD0 inf: msDRBD0:promote cloneFSDRBD0
>> >
>> > ^^^ This order should be msDRBD0:promote then cloneDLM:start
>> >
>> > If you explicitly define the action in an order statement for the
>> > resource then the same action is implied for the rest of the
>> > resources. So your statement is going to try to promote
>> > cloneFSDRBD0.
>> > You should define both actions explicitly like this:
>> >
>> > order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start
>> >
>> >> order ordO2CB_FSDRBD0 inf: cloneO2CB cloneFSDRBD0
>> >>
>> >
>> >
>> >> if i take down both nodes and fire them up again, everything goes
>> >> back
>> >> to normal and msDRBD0 is promoted to master on both nodes.
>> >>
>> >> I suspect this has something to do with ordering or colocation
>> >> constraints
>> >> but i'm not sure though. i've been staring at this problem for
>> >> dozens
>> >> of
>> >> times now and a vast amount of googling did not turn up my
>> >> specific
>> >> problem either.
>> >
>> > I'm pretty sure you are correct. I haven't used/tested OCFS on
>> > Pacemaker in awhile but I believe this is the correct
>> > ordering/collocation you're looking for (same as my notes above):
>> >
>> > Order - DRBD:promote then DLM:start then O2CB:start then FS:start
>> > Collocation - FS on O2CB on DLM on DRBD:master
>> >
>>
>> Hi Jake!
>>
>> Thanks very much for your comments!
>>
>> To sum it up i rewrote all six order/colo statements here:
>>
>> colocation colDLM_DRBD0 inf: cloneDLM msDRBD0:Master
>> colocation colO2CB_DLM inf: cloneO2CB cloneDLM
>> colocation colFSDRBD0_O2CB inf: cloneFSDRBD0 cloneO2CB
>>
>> order ordDRBD0_DLM inf: msDRBD0:promote cloneDLM:start
>> order ordDLM_O2CB inf: cloneDLM cloneO2CB
>> order ordO2CB_FSDRBD0 inf: cloneO2CB cloneFSDRBD0
>>
>> will try this sometime in the upcoming nights and will report back,
>> maybe in the meantime you could have a look at the statements again
>> to doublecheck? thanks in advance.
>
> Looks good to me (assuming I recall correctly that dlm needs to start
> before o2cb).
>
>>
>> best regards,
>> Jürgen Herrmann
>>
I can confirm that this immediately solved my problem.
Thanks again!!!
Jürgen Herrmann
--
>> XLhost.de ® - Webhosting von supersmall bis eXtra Large <<
XLhost.de GmbH
Jürgen Herrmann, Geschäftsführer
Boelckestrasse 21, 93051 Regensburg, Germany
Geschäftsführer: Jürgen Herrmann
Registriert unter: HRB9918
Umsatzsteuer-Identifikationsnummer: DE245931218
Fon: +49 (0)800 XLHOSTDE [0800 95467833]
Fax: +49 (0)800 95467830
Web: http://www.XLhost.de
More information about the Pacemaker
mailing list