[Pacemaker] Strange behaviour of dual master DRBD

Fri Sep 11 07:29:57 UTC 2009

Am Donnerstag, 10. September 2009 19:25:53 schrieb Lars Ellenberg:
> On Thu, Sep 10, 2009 at 05:10:39PM +0200, Michael Schwartzkopff wrote:
> > Hi,
> >
> > I configured a dual master DRBD-8.3.2. When the nodes stat there is no
> > problem. Both DRBD become master. But when I set on node to standby and
> > wake it up again, the DRBD on that node stays secondary and does not
> > become master.
> >
> > My config:
> > primitive resDRBD ocf:linbit:drbd params drbd_resource="r0"
> > ms msDRBD resDRBD meta notify="true" master-max="2"
> >
> > No further constraints.
> >
> > When the second node is online again ptest -sL shows:
> > (...)
> > resDRBD:0 promotion score on suse2: 50
> > resDRBD:1 promotion scpre on suse1: -1
> >
> > Since the "-1" prevents the resource from beein promoted, I understand
> > the behaviour of the cluster, but why isn't the resource beeing allowed
> > th become master on that node?
> >
> > Thanks for any enlightenting answers.
>
> Most likely it prevents you from shoting yourself in the foot ;)
>
> look at /proc/drbd and the kernel logs (appart from the ha.log, of
> course) on the drbd nodes to find out more.
> I bet you manoevered yourself into diverging data sets (aka DRBD "split
> brain").
>
> If it turns out to be a drbd.ocf bug,
> let me know.

Hi,

no I it not an DRBD-related problem. Status is connected, and after a cleanup 
the ressource gets promoted. I think it is a pacemaker-related problem since 
the resource gets -1 points for promotion and that should not happen.

Configuration:
node suse1 \
	attributes $id="nodes-suse1"
node suse2 \
	attributes $id="nodes-suse2"
primitive resDRBD ocf:linbit:drbd \
	params drbd_resource="r0"
ms msDRBD resDRBD \
	meta notify="true" master-max="2"
property $id="cib-bootstrap-options" \
	dc-version="1.0.5-13f3497959e894e57b8cb24f59c8683346b216e3" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	last-lrm-refresh="1252595100" \
	stonith-enabled="false"

 # ptest -sL
Allocation scores:
clone_color: msDRBD allocation score on suse1: 0
clone_color: msDRBD allocation score on suse2: 0
clone_color: resDRBD:0 allocation score on suse1: 51
clone_color: resDRBD:0 allocation score on suse2: 0
clone_color: resDRBD:1 allocation score on suse1: 0
clone_color: resDRBD:1 allocation score on suse2: 1
native_color: resDRBD:0 allocation score on suse1: 51
native_color: resDRBD:0 allocation score on suse2: 0
native_color: resDRBD:1 allocation score on suse1: -1000000
native_color: resDRBD:1 allocation score on suse2: 1
resDRBD:0 promotion score on suse1: 50
resDRBD:1 promotion score on suse2: -1

Complete log at:
http://pastebin.com/m781bb664

What my confuses at the log is:
(...)
Sep 11 09:20:45 suse1 pengine: [3510]: info: unpack_rsc_op: 
resDRBD:1_monitor_0 on suse2 returned 8 (master) instead of the expected 
value: 7 (not running)

Why does the cluster expect the resource not nunning? This is the surviving 
node where the reousr was running all the time.
(...)
Sep 11 09:20:45 suse1 pengine: [3510]: info: unpack_rsc_op: 
resDRBD:0_monitor_0 on suse1 returned 0 (ok) instead of the expected value: 7 
(not running)

Perhaps this is an error in the RA? The resource is NOT running on the node 
that just got online. So why does the RA report "online" instead of "not 
running"?

(...)

Greetings,
-- 
Dr. Michael Schwartzkopff
MultiNET Services GmbH
Addresse: Bretonischer Ring 7; 85630 Grasbrunn; Germany
Tel: +49 - 89 - 45 69 11 0
Fax: +49 - 89 - 45 69 11 21
mob: +49 - 174 - 343 28 75

mail: misch at multinet.de
web: www.multinet.de

Sitz der Gesellschaft: 85630 Grasbrunn
Registergericht: Amtsgericht München HRB 114375
Geschäftsführer: Günter Jurgeneit, Hubert Martens

---

PGP Fingerprint: F919 3919 FF12 ED5A 2801 DEA6 AA77 57A4 EDD8 979B
Skype: misch42