[Pacemaker] Problem with one drbd dual primary ressource
georg at riseup.net
georg at riseup.net
Sat Aug 10 19:44:42 UTC 2013
Hi all,
(If this is not the correct mailinglist for asking this question I
apologize...maybe you could then give me a hint where this might fit
better.)
I've setup a test cluster with Debian Wheezy, pacemaker/corosync, drbd
dual primary and Xen on top of this. I know I should configure STONITH;
up until now I just didn't do this because it's a test environment.
I've got serveral drbd ressources and Xen dom0s. For most of them live
migration is working like a charm...but I've got problems with one drbd
ressource and the state changing after a Xen dom0 migration (I guess
that this is the problem at least). I checked my configuration for
differences between the ressources, but I didn't found them, and as long
as I remember correctly, they're all set up identically.
The drbd.conf snippet of the ressource looks like:
resource nfs {
flexible-meta-disk internal;
device /dev/drbd4;
protocol C;
on ha1 {
device /dev/drbd4;
disk /dev/XenHosting/nfs-disk;
address 10.10.10.1:7804;
}
on ha2 {
device /dev/drbd4;
disk /dev/XenHosting/nfs-disk;
address 10.10.10.2:7804;
}
net {
data-integrity-alg sha1;
allow-two-primaries;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
startup {
become-primary-on both;
}
}
Relevant snippets of cib:
primitive p_drbd_nfs ocf:linbit:drbd \
params drbd_resource="nfs" \
op monitor interval="20" role="Master" timeout="60" \
op monitor interval="30" role="Slave" timeout="60"
ms ms_drbd_nfs p_drbd_nfs \
meta master-max="2" notify="true" target-role="Started"
primitive nfs ocf:heartbeat:Xen \
params xmfile="/cluster/xen/nfs" \
meta allow-migrate="true" target-role="Started" \
op monitor interval="10" \
op start interval="0" timeout="45" \
op stop interval="0" timeout="300" \
op migrate_from interval="0" timeout="240" \
op migrate_to interval="0" timeout="240"
order o_nfs inf: ms_drbd_nfs:promote nfs:start
If I start the ressources "clean", which means if they're not yet
running, all is fine. If I after this stop the dom0 or do a live
migration, I've got "failed actions", like:
Failed actions:
p_drbd_nfs:1_monitor_20000 (node=ha2, call=837, rc=0, status=complete): ok
(This will change from failed to ok during two seconds, live migration
is successfull.)
I played around with timeouts and stuff, but no luck. In the logs there
is written "transition aborted", could this be the problem?
Next to: "Sending state for detaching disk failed".
I've put my logs for better reading to pastebin:
[0] tail -f /var/log/syslog
[1] tail -f /var/log/dmesg
I'm quite clueless what to do.
Help would be really appreciated...
Thanks in advance,
Georg
[0] http://pastebin.com/09FT14Us
[1] http://pastebin.com/5nnwZjiz
More information about the Pacemaker
mailing list