[ClusterLabs] DRBD fencing issue on failover causes resource failure
Tim Walberg
twalberg at gmail.com
Wed Mar 16 17:17:31 UTC 2016
Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
(drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
resources consist of a cluster address, a DRBD device mirroring between the
two cluster nodes, the file system, and the nfs-server resource. The
resources all behave properly until an extended failover or outage.
I have tested failover in several ways ("pcs cluster standby", "pcs cluster
stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.) and the
symptoms are that, until the killed node is brought back into the cluster,
failover never seems to complete. The DRBD device appears on the remaining
node to be in a "Secondary/Unknown" state, and the resources end up looking
like:
# pcs status
Cluster name: nfscluster
Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16
12:04:46 2016 by root via cibadmin on nfsnode01
Stack: corosync
Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition with
quorum
2 nodes and 5 resources configured
Online: [ nfsnode01 ]
OFFLINE: [ nfsnode02 ]
Full list of resources:
nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01
nfs-server (systemd:nfs-server): Stopped
Master/Slave Set: drbd_master [drbd_dev]
Slaves: [ nfsnode01 ]
Stopped: [ nfsnode02 ]
drbd_fs (ocf::heartbeat:Filesystem): Stopped
PCSD Status:
nfsnode01: Online
nfsnode02: Online
Daemon Status:
corosync: active/enabled
pacemaker: active/enabled
pcsd: active/enabled
As soon as I bring the second node back online, the failover completes. But
this is obviously not a good state, as an extended outage for any reason on
one node essentially kills the cluster services. There's obviously
something I've missed in configuring the resources, but I haven't been able
to pinpoint it yet.
Perusing the logs, it appears that, upon the initial failure, DRBD does in
fact promote the drbd_master resource, but immediately after that, pengine
calls for it to be demoted for reasons I haven't been able to determine
yet, but seems to be tied to the fencing configuration. I can see that the
crm-fence-peer.sh script is called, but it almost seems like it's fencing
the wrong node... Indeed, I do see that it adds a -INFINITY location
constraint for the surviving node, which would explain the decision to
demote the DRBD master.
My DRBD resource looks like this:
# cat /etc/drbd.d/drbd0.res
resource drbd0 {
protocol C;
startup { wfc-timeout 0; degr-wfc-timeout 120; }
disk {
on-io-error detach;
fencing resource-only;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
}
on nfsnode01 {
device /dev/drbd0;
disk /dev/vg_nfs/lv_drbd0;
meta-disk internal;
address 10.0.0.2:7788;
}
on nfsnode02 {
device /dev/drbd0;
disk /dev/vg_nfs/lv_drbd0;
meta-disk internal;
address 10.0.0.3:7788;
}
}
If I comment out the three lines having to do with fencing, the failover
works properly. But I'd prefer to have the fencing there in the odd chance
that we end up with a split brain instead of just a node outage...
And, here's "pcs config --full":
# pcs config --full
Cluster Name: nfscluster
Corosync Nodes:
nfsnode01 nfsnode02
Pacemaker Nodes:
nfsnode01 nfsnode02
Resources:
Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.0.0.1 cidr_netmask=24
Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
monitor interval=15s (nfsVIP-monitor-interval-15s)
Resource: nfs-server (class=systemd type=nfs-server)
Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
Master: drbd_master
Meta Attrs: master-max=1 master-node-max=1 clone-max=2 clone-node-max=1
notify=true
Resource: drbd_dev (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=drbd0
Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s)
promote interval=0s timeout=90 (drbd_dev-promote-interval-0s)
demote interval=0s timeout=90 (drbd_dev-demote-interval-0s)
stop interval=0s timeout=100 (drbd_dev-stop-interval-0s)
monitor interval=29s role=Master
(drbd_dev-monitor-interval-29s)
monitor interval=31s role=Slave
(drbd_dev-monitor-interval-31s)
Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs
Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)
stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)
monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
start nfsVIP then start nfs-server (kind:Mandatory)
(id:order-nfsVIP-nfs-server-mandatory)
start drbd_fs then start nfs-server (kind:Mandatory)
(id:order-drbd_fs-nfs-server-mandatory)
promote drbd_master then start drbd_fs (kind:Mandatory)
(id:order-drbd_master-drbd_fs-mandatory)
Colocation Constraints:
nfs-server with nfsVIP (score:INFINITY)
(id:colocation-nfs-server-nfsVIP-INFINITY)
nfs-server with drbd_fs (score:INFINITY)
(id:colocation-nfs-server-drbd_fs-INFINITY)
drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master)
(id:colocation-drbd_fs-drbd_master-INFINITY)
Resources Defaults:
resource-stickiness: 100
failure-timeout: 60
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: nfscluster
dc-version: 1.1.13-10.el7_2.2-44eb2dd
have-watchdog: false
maintenance-mode: false
stonith-enabled: false
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160316/adb18ebd/attachment-0003.html>
More information about the Users
mailing list