[Pacemaker] pacemaker fails to start drbd using ocf:linbit:drbd

Thu Jul 1 17:05:09 UTC 2010

Hi Martin,

I don't have drbd set to startup automatically in any runlevel and I did a
"rcdrbd stop" on both nodes before starting openais. I just repeated it one
more time, checking lsmod first to confirm drbd is not loaded and the result
is the same. One piece of extra information is that even though drbd fails
to start up correctly, there is at least partial success:

storm:~ # rcdrbd status
drbd driver loaded OK; device status:
version: 8.3.7 (api:88/proto:86-91)
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre,
2010-01-13 17:17:27
m:res  cs         ro                   ds                 p  mounted  fstype
0:r0   Connected  Secondary/Secondary  UpToDate/UpToDate  C

This status is the same on both nodes. It looks like all that 's missing is
to promote the correct node.

Starting up drbd manually takes about 15s. 

Starting openais on storm only, only started my stonith-fencing resource,
none of the others. I 'm going to simplify my setup and get rid of
everything but the core resources.

Is pacemaker 1.1.2 (the version included in SLES 11 SP1 HA) actually stable?
The highest pre-built binary version available from clusterlabs.org seems to
be 1.0.9.

Thanks,
Bart

-----Original Message-----
From: martin.braun at icw.de [mailto:martin.braun at icw.de] 
Sent: Thursday, July 01, 2010 11:03
To: bart at atipa.com; The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] pacemaker fails to start drbd using ocf:linbit:drbd

Hi Bart,
.
Just some more thoughts:

Are you sure that drbd was really stopped? 
Does this error also happen after a clean restart (without drbd starting 
at runlevel), i.e. "lsmod | grep drbd"  without results?
How long does it take if you setup drbd (attach,syncer,connect,primary) 
manually? 
What happens when you start openais on only one node?

The syncer rate seems a bit high to me (
http://www.drbd.org/users-guide/s-configure-syncer-rate.html#eq-syncer-rate-
example1
), but that should not be the problem.

HTH,
Martin

"Bart Willems" <bart at atipa.com> wrote on 01.07.2010 16:42:26:

> [image removed] 
> 
> Re: [Pacemaker] pacemaker fails to start drbd using ocf:linbit:drbd
> 
> Bart Willems 
> 
> to:
> 
> 'The Pacemaker cluster resource manager'
> 
> 01.07.2010 16:46
> 
> Please respond to bart, The Pacemaker cluster resource manager 
> 
> Hi Martin,
> 
> No luck I 'm afraid. I first added a start-delay to the monitor 
operations,
> and when that didn't work I also added a start-delay to the start 
operation:
> 
> primitive drbd-storage ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="10" role="Master" timeout="60" 
start-delay="1m"
> \
>         op start interval="0" timeout="240s" start-delay="1m" \
>         op stop interval="0" timeout="100s" \
>         op monitor interval="20" role="Slave" timeout="60" 
start-delay="1m"
> 
> Thanks,
> Bart
> 
> -----Original Message-----
> From: martin.braun at icw.de [mailto:martin.braun at icw.de] 
> Sent: Thursday, July 01, 2010 3:37
> To: bart at atipa.com; The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] pacemaker fails to start drbd using 
ocf:linbit:drbd
> 
> Hi Bart,
> 
> my guess is that you did  forget the start-delay attribute for the 
monitor 
> operations, that's why you see the time-out error message.
> 
> Here is an example:
> 
> 
>         op monitor interval="20" role="Slave" timeout="20" 
> start-delay="1m" \
>         op monitor interval="10" role="Master" timeout="20" 
> start-delay="1m" \
>         op start interval="0" timeout="240s" \
>         op stop interval="0" timeout="100s" \
>         params drbd_resource="r0" drbdconf="/usr/local/etc/drbd.conf"
> 
> HTH,
> Martin
> 
> 
> 
> "Bart Willems" <bart at atipa.com> wrote on 30.06.2010 21:57:35:
> 
> > [image removed] 
> > 
> > [Pacemaker] pacemaker fails to start drbd using ocf:linbit:drbd
> > 
> > Bart Willems 
> > 
> > to:
> > 
> > pacemaker
> > 
> > 30.06.2010 21:56
> > 
> > [image removed] 
> > 
> > From:
> > 
> > "Bart Willems" <bart at atipa.com>
> > 
> > To:
> > 
> > <pacemaker at oss.clusterlabs.org>
> > 
> > Please respond to bart at atipa.com, The Pacemaker cluster resource 
> > manager <pacemaker at oss.clusterlabs.org>
> > 
> > Hi All,
> > 
> > I am setting SLES11 SP1 HA on 2 nodes and have configures a 
master/slave
> > drbd resource. I can start drbd, promote/demote hosts. mount/use the 
> file
> > system from the command line, but pacemaker fails to properly start up 

> the
> > drdb service. The 2 nodes are named storm (master) and storm-b 
(slave). 
> > 
> > Details of my setup are:
> > 
> > **********
> > * storm: *
> > **********
> > 
> > eth0: 172.16.0.1/16 (static)
> > eth1: 172.20.168.239 (dhcp)
> > ipmi: 172.16.1.1/16 (static)
> > 
> > ************
> > * storm-b: *
> > ************
> > 
> > eth0: 172.16.0.2/16 (static)
> > eth1: 172.20.168.114 (dhcp)
> > ipmi: 172.16.1.2/16 (static)
> > 
> > ***********************
> > * drbd configuration: *
> > ***********************
> > 
> > storm:~ # cat /etc/drbd.conf 
> > #
> > # please have a a look at the example configuration file in
> > # /usr/share/doc/packages/drbd-utils/drbd.conf
> > #
> > # Note that you can use the YaST2 drbd module to configure this
> > # service!
> > #
> > include "drbd.d/global_common.conf";
> > include "drbd.d/*.res";
> > 
> > storm:~ # cat /etc/drbd.d/r0.res 
> > resource r0 {
> >         device /dev/drbd_r0 minor 0;
> >         meta-disk internal;
> >         on storm {
> >                 disk /dev/sdc1;
> >                 address 172.16.0.1:7811;
> >         }
> >         on storm-b {
> >                 disk /dev/sde1;
> >                 address 172.16.0.2:7811;
> >         }
> >         syncer  {
> >                 rate    120M;
> >         }
> > }
> > 
> > ***********************************
> > * Output of "crm configure show": *
> > ***********************************
> > 
> > storm:~ # crm configure show
> > node storm
> > node storm-b
> > primitive backupExec-ip ocf:heartbeat:IPaddr \
> >         params ip="172.16.0.10" cidr_netmask="16" nic="eth0" \
> >         op monitor interval="30s"
> > primitive drbd-storage ocf:linbit:drbd \
> >         params drbd_resource="r0" \
> >         op monitor interval="60" role="Master" timeout="60" \
> >         op start interval="0" timeout="240" \
> >         op stop interval="0" timeout="100" \
> >         op monitor interval="61" role="Slave" timeout="60"
> > primitive drbd-storage-fs ocf:heartbeat:Filesystem \
> >         params device="/dev/drbd0" directory="/disk1" fstype="ext3"
> > primitive public-ip ocf:heartbeat:IPaddr \
> >         meta target-role="started" \
> >         operations $id="public-ip-operations" \
> >         op monitor interval="30s" \
> >         params ip="143.219.41.20" cidr_netmask="24" nic="eth1"
> > primitive storm-fencing stonith:external/ipmi \
> >         meta target-role="started" \
> >         operations $id="storm-fencing-operations" \
> >         op monitor interval="60" timeout="20" \
> >         op start interval="0" timeout="20" \
> >         params hostname="storm" ipaddr="172.16.1.1" userid="****"
> > passwd="****" interface="lan"
> > ms drbd-storage-masterslave drbd-storage \
> >         meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true" globally-unique="false"
> > target-role="started"
> > location drbd-storage-master-location drbd-storage-masterslave +inf: 
> storm
> > location storm-fencing-location storm-fencing +inf: storm-b
> > colocation drbd-storage-fs-together inf: drbd-storage-fs
> > drbd-storage-masterslave:Master
> > order drbd-storage-fs-startup-order inf: 
> drbd-storage-masterslave:promote
> > drbd-storage-fs:start
> > property $id="cib-bootstrap-options" \
> >         dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
> >         cluster-infrastructure="openais" \
> >         expected-quorum-votes="2" \
> >         no-quorum-policy="ignore" \
> >         last-lrm-refresh="1277922623" \
> >         node-health-strategy="only-green" \
> >         stonith-enabled="true" \
> >         stonith-action="poweroff"
> > op_defaults $id="op_defaults-options" \
> >         record-pending="false"
> > 
> > ************************************
> > * Output of "crm_mon -o" on storm: *
> > ************************************
> > 
> > storm:~ # crm_mon -o 
> > Attempting connection to the cluster...
> > ============
> > Last updated: Wed Jun 30 15:25:15 2010
> > Stack: openais
> > Current DC: storm - partition with quorum
> > Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> > 2 Nodes configured, 2 expected votes
> > 5 Resources configured.
> > ============
> > 
> > Online: [ storm storm-b ]
> > 
> > storm-fencing   (stonith:external/ipmi):        Started storm-b
> > backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
> > public-ip       (ocf::heartbeat:IPaddr):        Started storm
> > 
> > Operations:
> > * Node storm: 
> >    public-ip: migration-threshold=1000000
> >     + (8) start: rc=0 (ok)
> >     + (11) monitor: interval=30000ms rc=0 (ok)
> >    backupExec-ip: migration-threshold=1000000
> >     + (7) start: rc=0 (ok)
> >     + (10) monitor: interval=30000ms rc=0 (ok)
> >    drbd-storage:0: migration-threshold=1000000 fail-count=1000000
> >     + (9) start: rc=-2 (unknown exec error)
> >     + (14) stop: rc=0 (ok)
> > * Node storm-b: 
> >    storm-fencing: migration-threshold=1000000    + (7) start: rc=0 
(ok) 
>  +
> > (9) monitor: interval=6)
> > 
> > ************************************** 
> > * Output of "crm_mon -o" on storm-b: *
> > **************************************
> > 
> > storm-b:~ # crm_mon -o
> > Attempting connection to the cluster...
> > ============
> > Last updated: Wed Jun 30 15:25:25 2010
> > Stack: openais
> > Current DC: storm - partition with quorum
> > Version: 1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5
> > 2 Nodes configured, 2 expected votes
> > 5 Resources configured.
> > ============
> > 
> > Online: [ storm storm-b ]
> > 
> > storm-fencing   (stonith:external/ipmi):        Started storm-b
> > backupExec-ip   (ocf::heartbeat:IPaddr):        Started storm
> > public-ip       (ocf::heartbeat:IPaddr):        Started storm
> > 
> > Operations:
> > * Node storm: 
> >    public-ip: migration-threshold=1000000
> >     + (8) start: rc=0 (ok)
> >     + (11) monitor: interval=30000ms rc=0 (ok)
> >    backupExec-ip: migration-threshold=1000000
> >     + (7) start: rc=0 (ok)
> >     + (10) monitor: interval=30000ms rc=0 (ok)
> >    drbd-storage:0: migration-threshold=1000000 fail-count=1000000
> >     + (9) start: rc=-2 (unknown exec error)
> >     + (14) stop: rc=0 (ok)
> > * Node storm-b: 
> >    storm-fencing: migration-threshold=1000000
> >     + (7) start: rc=0 (ok)
> >     + (9) monitor: interval=60000ms rc=0 (ok)
> >    drbd-storage:1: migration-threshold=1000000 fail-count=1000000
> >     + (8) start: rc=-2 (unknown exec error)
> >     + (12) stop: rc=0 (ok)
> > 
> > Failed actions:
> >     drbd-storage:0_start_0 (node=storm, call=9, rc=-2, status=Timed 
> Out):
> > unknown exec error
> >     drbd-storage:1_start_0 (node=storm-b, call=8, rc=-2, status=Timed 
> Out):
> > unknown exec error
> > 
> > 
> > ********************************************************
> > * Output of "rcdrbd status" on both storm and storm-b: *
> > ********************************************************
> > 
> > # rcdrbd status
> > drbd driver loaded OK; device status:
> > version: 8.3.7 (api:88/proto:86-91)
> > GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by 
> phil at fat-tyre,
> > 2010-01-13 17:17:27
> > m:res  cs          ro                 ds                 p mounted
> > fstype
> > 0:r0   StandAlone  Secondary/Unknown  UpToDate/DUnknown  r----
> > 
> > *********************************
> > * Part of the drbd log entries: *
> > *********************************
> > 
> > Jun 30 15:38:10 storm kernel: [ 3730.185457] drbd: initialized. 
Version:
> > 8.3.7 (api:88/proto:86-91)
> > Jun 30 15:38:10 storm kernel: [ 3730.185459] drbd: GIT-hash:
> > ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by phil at fat-tyre, 
> 2010-01-13
> > 17:17:27
> > Jun 30 15:38:10 storm kernel: [ 3730.185460] drbd: registered as block
> > device major 147
> > Jun 30 15:38:10 storm kernel: [ 3730.185462] drbd: minor_table @
> > 0xffff88035fc0ca80
> > Jun 30 15:38:10 storm kernel: [ 3730.188253] block drbd0: Starting 
> worker
> > thread (from cqueue [9510])
> > Jun 30 15:38:10 storm kernel: [ 3730.188312] block drbd0: disk( 
Diskless 
> ->
> > Attaching ) 
> > Jun 30 15:38:10 storm kernel: [ 3730.188866] block drbd0: Found 4
> > transactions (4 active extents) in activity log.
> > Jun 30 15:38:10 storm kernel: [ 3730.188868] block drbd0: Method to 
> ensure
> > write ordering: barrier
> > Jun 30 15:38:10 storm kernel: [ 3730.188870] block drbd0: 
> max_segment_size (
> > = BIO size ) = 32768
> > Jun 30 15:38:10 storm kernel: [ 3730.188872] block drbd0: 
drbd_bm_resize
> > called with capacity == 9765216
> > Jun 30 15:38:10 storm kernel: [ 3730.188907] block drbd0: resync 
bitmap:
> > bits=1220652 words=19073
> > Jun 30 15:38:10 storm kernel: [ 3730.188910] block drbd0: size = 4768 
MB
> > (4882608 KB)
> > Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
> > (drbd-storage:0:start:stdout) 
> > Jun 30 15:38:10 storm kernel: [ 3730.189263] block drbd0: recounting 
of 
> set
> > bits took additional 0 jiffies
> > Jun 30 15:38:10 storm kernel: [ 3730.189265] block drbd0: 4 KB (1 
bits)
> > marked out-of-sync by on disk bit-map.
> > Jun 30 15:38:10 storm kernel: [ 3730.189269] block drbd0: disk( 
> Attaching ->
> > UpToDate ) 
> > Jun 30 15:38:10 storm kernel: [ 3730.191735] block drbd0: conn( 
> StandAlone
> > -> Unconnected ) 
> > Jun 30 15:38:10 storm kernel: [ 3730.191748] block drbd0: Starting 
> receiver
> > thread (from drbd0_worker [15487])
> > Jun 30 15:38:10 storm kernel: [ 3730.191780] block drbd0: receiver
> > (re)started
> > Jun 30 15:38:10 storm kernel: [ 3730.191785] block drbd0: conn( 
> Unconnected
> > -> WFConnection ) 
> > Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
> > (drbd-storage:0:start:stderr) 0: Failure: (124) Device is attached to 
a 
> disk
> > (use detach first)
> > Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
> > (drbd-storage:0:start:stderr) Command 'drbdsetup 0 disk /dev/sdc1 
> /dev/sdc1
> > internal 
> > Jun 30 15:38:10 storm lrmd: [15233]: info: RA output:
> > (drbd-storage:0:start:stderr) --set-defaults --create-device' 
terminated
> > with exit code 10
> > Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Called drbdadm -c
> > /etc/drbd.conf --peer storm-b up r0
> > Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Exit code 1
> > Jun 30 15:38:10 storm drbd[15341]: ERROR: r0: Command output: 
> > 
> > I made sure rcdrbd was stopped before starting rcopenais, so the 
failure
> > related to the device being attached arrises during openais startup.
> > 
> > *************************
> > * Result of ocf-tester: *
> > *************************
> > 
> > storm:~ # ocf-tester -n drbd-storage -o drbd_resource="r0"
> > /usr/lib/ocf/resource.d/linbit/drbd
> > Beginning tests for /usr/lib/ocf/resource.d/linbit/drbd...
> > * rc=6: Validation failed.  Did you supply enough options with -o ?
> > Aborting tests
> > 
> > The only required parameter according to "crm ra info ocf:linbit:drbd" 

> is
> > drbd_resource, so there shouldn't be any additional options required 
to 
> make
> > ocf-tester work.
> > 
> > 
> > Any suggestions for debugging and solutions would be most appreciated.
> > 
> > Thanks,
> > Bart
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: 
http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?
> > product=Pacemaker
> 
> 
> InterComponentWare AG: 
> Vorstand: Peter Kirschbauer (Vors.), Jvrg Stadler / Aufsichtsratsvors.:
> Prof. Dr. Christof Hettich 
> Firmensitz: 69190 Walldorf, Altrottstra_e 31 / AG Mannheim HRB 351761 /
> USt.-IdNr.: DE 198388516  =
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?
> product=Pacemaker

InterComponentWare AG:  
Vorstand: Peter Kirschbauer (Vors.), Jvrg Stadler / Aufsichtsratsvors.:
Prof. Dr. Christof Hettich  
Firmensitz: 69190 Walldorf, Altrottstra_e 31 / AG Mannheim HRB 351761 /
USt.-IdNr.: DE 198388516  =