[Pacemaker] Problem with one drbd dual primary ressource

Sat Aug 10 19:44:42 UTC 2013

Hi all,

(If this is not the correct mailinglist for asking this question I
apologize...maybe you could then give me a hint where this might fit
better.)

I've setup a test cluster with Debian Wheezy, pacemaker/corosync, drbd
dual primary and Xen on top of this. I know I should configure STONITH;
up until now I just didn't do this because it's a test environment.

I've got serveral drbd ressources and Xen dom0s. For most of them live
migration is working like a charm...but I've got problems with one drbd
ressource and the state changing after a Xen dom0 migration (I guess
that this is the problem at least). I checked my configuration for
differences between the ressources, but I didn't found them, and as long
as I remember correctly, they're all set up identically.

The drbd.conf snippet of the ressource looks like:

resource nfs {
    flexible-meta-disk internal;
    device /dev/drbd4;
    protocol     C;
  on ha1 {
    device /dev/drbd4;
    disk /dev/XenHosting/nfs-disk;
    address 10.10.10.1:7804;
  }
  on ha2 {
    device /dev/drbd4;
    disk /dev/XenHosting/nfs-disk;
    address 10.10.10.2:7804;
  }
  net {
    data-integrity-alg sha1;
    allow-two-primaries;
    after-sb-0pri  discard-zero-changes;
    after-sb-1pri  discard-secondary;
    after-sb-2pri  disconnect;
   }
   startup {
    become-primary-on both;
   }
}

Relevant snippets of cib:

primitive p_drbd_nfs ocf:linbit:drbd \
        params drbd_resource="nfs" \
        op monitor interval="20" role="Master" timeout="60" \
        op monitor interval="30" role="Slave" timeout="60"

ms ms_drbd_nfs p_drbd_nfs \
        meta master-max="2" notify="true" target-role="Started"

primitive nfs ocf:heartbeat:Xen \
        params xmfile="/cluster/xen/nfs" \
        meta allow-migrate="true" target-role="Started" \
        op monitor interval="10" \
        op start interval="0" timeout="45" \
        op stop interval="0"  timeout="300" \
        op migrate_from  interval="0" timeout="240" \
        op migrate_to interval="0" timeout="240"

order o_nfs inf: ms_drbd_nfs:promote nfs:start

If I start the ressources "clean", which means if they're not yet
running, all is fine. If I after this stop the dom0 or do a live
migration, I've got "failed actions", like:

Failed actions:
    p_drbd_nfs:1_monitor_20000 (node=ha2, call=837, rc=0, status=complete): ok

(This will change from failed to ok during two seconds, live migration
is successfull.) 

I played around with timeouts and stuff, but no luck. In the logs there
is written  "transition aborted", could this be the problem?

Next to: "Sending state for detaching disk failed".

I've put my logs for better reading to pastebin:

[0] tail -f /var/log/syslog
[1] tail -f /var/log/dmesg

I'm quite clueless what to do.
Help would be really appreciated...

Thanks in advance,
Georg

[0] http://pastebin.com/09FT14Us
[1] http://pastebin.com/5nnwZjiz