[Pacemaker] Decreasing failover time when running DRBD+OCFS2+XEN in dual primary mode

Mon Jun 16 19:35:01 EDT 2014

On 13 Jun 2014, at 5:35 pm, kamal kishi <kamal.kishi at gmail.com> wrote:

> Hi Andrew, 
> Checked the logs and I felt OCFS2 taking time to recover, can anyone please verify my log and confirm if I'm correct.

I see:

Jun 13 17:48:57 server2 crmd: [1840]: info: do_lrm_rsc_op: Performing key=6:111:0:68dbc62f-5255-4a32-915b-fbb3954c6092 op=resXen1_start_0 )
...
Jun 13 17:50:14 server2 lrmd: [1837]: info: RA output: (resXen1:start:stdout) Using config file "/home/cluster/xen/win7.cfg".#012Started domain xenwin7 (id=5)
Jun 13 17:50:16 server2 lrmd: [1837]: info: operation start[115] on resXen1 for client 1840: pid 20673 exited with return code 0

Which looks like Xen taking 70s to start which is almost the entire period covered by the logs.

> 
> And if OCFS2 is the reason for delay in failover may I know a way to reduce that delay caused.
> 
> Attached is my syslog and pacemaker configuration
> 
> Looking forward for a solution 
> 
> 
> On Fri, Jun 13, 2014 at 8:55 AM, kamal kishi <kamal.kishi at gmail.com> wrote:
> Fine Andrew, will check it out but does the timeouts provided for pacemaker affect this??
> Which part of the time configuration will be considered by pacemaker to decide if the other node is actually down and the resources should be taken over by it.
> 
> And Alexis, I'm not facing any issue while putting node to standby mode.
> I'm using DRBD 8.3.11 (apt-get install drbd8-utils=2:8.3.11-0ubuntu1)
> Had to force the download to particular version as the current download/patch is not compatible with pacemaker.
> You too try to install 8.3.11 and check once, all the best
> 
> 
> On Fri, Jun 13, 2014 at 5:22 AM, Andrew Beekhof <andrew at beekhof.net> wrote:
> 
> On 12 Jun 2014, at 9:15 pm, kamal kishi <kamal.kishi at gmail.com> wrote:
> 
> > Hi All,
> >
> > This might be a basic question but I'm not sure whats taking time for failover switching.
> > Hope anyone can figure it out.
> 
> How about looking in the logs and seeing when the various stop/start actions occur and which ones take the longest?
> 
> >
> > Scenario -
> > Pacemaker running DRBD(Dual primary mode)+OCFS2+XEN for Virtual windows machine
> >
> > Pacemaker startup starts -
> > DRBD -> OCFS2 -> XEN
> > Lets consider under Server1  - DRBD, OCFS2(clone) and XEN are started
> >
> > Server2 - DRBD, OCFS2(clone) are started
> >
> > Now if Server1 power is OFF
> >
> > The XEN resource which was running under Server1 should be failed over to Server2.
> >
> > In my case, its taking almost 90 to 110 seconds to do this.
> >
> > Can anyone suggest me ways to reduce it to within 30 to 40 seconds
> >
> > My pacemaker configuration is -
> > crm configure
> > property no-quorum-policy=ignore
> > property stonith-enabled=false
> > property default-resource-stickiness=1000
> >
> > primitive resDRBDr1 ocf:linbit:drbd \
> > params drbd_resource="r0" \
> > op start interval="0" timeout="240s" \
> > op stop interval="0" timeout="100s" \
> > op monitor interval="20s" role="Master" timeout="240s" \
> > op monitor interval="30s" role="Slave" timeout="240s" \
> > meta migration-threshold="3" failure-timeout="60s"
> > primitive resOCFS2r1 ocf:heartbeat:Filesystem \
> > params device="/dev/drbd/by-res/r0" directory="/cluster" fstype="ocfs2" \
> > op monitor interval="10s" timeout="60s" \
> > op start interval="0" timeout="90s" \
> > op stop interval="0" timeout="60s" \
> > meta migration-threshold="3" failure-timeout="60s"
> > primitive resXen1 ocf:heartbeat:Xen \
> > params xmfile="/home/cluster/xen/win7.cfg" name="xenwin7" \
> > op monitor interval="20s" timeout="60s" \
> > op start interval="0" timeout="90s" \
> > op stop interval="0" timeout="60s" \
> > op migrate_from interval="0" timeout="120s" \
> > op migrate_to interval="0" timeout="120s" \
> > meta allow-migrate="true" target-role="started"
> >
> > ms msDRBDr1 resDRBDr1 \
> > meta notify="true" master-max="2" interleave="true" target-role="Started"
> > clone cloOCFS2r1 resOCFS2r1 \
> > meta interleave="true" ordered="true" target-role="Started"
> >
> > colocation colOCFS12-with-DRBDrMaster inf: cloOCFS2r1 msDRBDr1:Master
> > colocation colXen-with-OCFSr1 inf: resXen1 cloOCFS2r1
> > order ordDRBD-before-OCFSr1 inf: msDRBDr1:promote cloOCFS2r1:start
> > order ordOCFS2r1-before-Xen1 inf: cloOCFS2r1:start resXen1:start
> >
> > commit
> > bye
> >
> > --
> > Regards,
> > Kamal Kishore B V
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> 
> 
> 
> -- 
> Regards,
> Kamal Kishore B V
> 
> 
> 
> -- 
> Regards,
> Kamal Kishore B V
> <Pacemaker.txt><syslog.txt>_______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140617/6ca65084/attachment-0003.sig>