[ClusterLabs] DRBD fencing issue on failover causes resource failure
Digimer
lists at alteeve.ca
Wed Mar 16 17:34:55 UTC 2016
On 16/03/16 01:17 PM, Tim Walberg wrote:
> Having an issue on a newly built CentOS 7.2.1511 NFS cluster with DRBD
> (drbd84-utils-8.9.5-1 with kmod-drbd84-8.4.7-1_1). At this point, the
> resources consist of a cluster address, a DRBD device mirroring between
> the two cluster nodes, the file system, and the nfs-server resource. The
> resources all behave properly until an extended failover or outage.
>
> I have tested failover in several ways ("pcs cluster standby", "pcs
> cluster stop", "init 0", "init 6", "echo b > /proc/sysrq-trigger", etc.)
> and the symptoms are that, until the killed node is brought back into
> the cluster, failover never seems to complete. The DRBD device appears
> on the remaining node to be in a "Secondary/Unknown" state, and the
> resources end up looking like:
>
> # pcs status
> Cluster name: nfscluster
> Last updated: Wed Mar 16 12:05:33 2016 Last change: Wed Mar 16
> 12:04:46 2016 by root via cibadmin on nfsnode01
> Stack: corosync
> Current DC: nfsnode01 (version 1.1.13-10.el7_2.2-44eb2dd) - partition
> with quorum
> 2 nodes and 5 resources configured
>
> Online: [ nfsnode01 ]
> OFFLINE: [ nfsnode02 ]
>
> Full list of resources:
>
> nfsVIP (ocf::heartbeat:IPaddr2): Started nfsnode01
> nfs-server (systemd:nfs-server): Stopped
> Master/Slave Set: drbd_master [drbd_dev]
> Slaves: [ nfsnode01 ]
> Stopped: [ nfsnode02 ]
> drbd_fs (ocf::heartbeat:Filesystem): Stopped
>
> PCSD Status:
> nfsnode01: Online
> nfsnode02: Online
>
> Daemon Status:
> corosync: active/enabled
> pacemaker: active/enabled
> pcsd: active/enabled
>
> As soon as I bring the second node back online, the failover completes.
> But this is obviously not a good state, as an extended outage for any
> reason on one node essentially kills the cluster services. There's
> obviously something I've missed in configuring the resources, but I
> haven't been able to pinpoint it yet.
>
> Perusing the logs, it appears that, upon the initial failure, DRBD does
> in fact promote the drbd_master resource, but immediately after that,
> pengine calls for it to be demoted for reasons I haven't been able to
> determine yet, but seems to be tied to the fencing configuration. I can
> see that the crm-fence-peer.sh script is called, but it almost seems
> like it's fencing the wrong node... Indeed, I do see that it adds a
> -INFINITY location constraint for the surviving node, which would
> explain the decision to demote the DRBD master.
>
> My DRBD resource looks like this:
>
> # cat /etc/drbd.d/drbd0.res
> resource drbd0 {
>
> protocol C;
> startup { wfc-timeout 0; degr-wfc-timeout 120; }
>
> disk {
> on-io-error detach;
> fencing resource-only;
This should be 'resource-and-stonith;', but alone won't do anything
until pacemaker's stonith is working.
> }
>
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> }
>
> on nfsnode01 {
> device /dev/drbd0;
> disk /dev/vg_nfs/lv_drbd0;
> meta-disk internal;
> address 10.0.0.2:7788 <http://10.0.0.2:7788>;
> }
>
> on nfsnode02 {
> device /dev/drbd0;
> disk /dev/vg_nfs/lv_drbd0;
> meta-disk internal;
> address 10.0.0.3:7788 <http://10.0.0.3:7788>;
> }
> }
>
> If I comment out the three lines having to do with fencing, the failover
> works properly. But I'd prefer to have the fencing there in the odd
> chance that we end up with a split brain instead of just a node outage...
>
> And, here's "pcs config --full":
>
> # pcs config --full
> Cluster Name: nfscluster
> Corosync Nodes:
> nfsnode01 nfsnode02
> Pacemaker Nodes:
> nfsnode01 nfsnode02
>
> Resources:
> Resource: nfsVIP (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.0.0.1 cidr_netmask=24
> Operations: start interval=0s timeout=20s (nfsVIP-start-interval-0s)
> stop interval=0s timeout=20s (nfsVIP-stop-interval-0s)
> monitor interval=15s (nfsVIP-monitor-interval-15s)
> Resource: nfs-server (class=systemd type=nfs-server)
> Operations: monitor interval=60s (nfs-server-monitor-interval-60s)
> Master: drbd_master
> Meta Attrs: master-max=1 master-node-max=1 clone-max=2
> clone-node-max=1 notify=true
> Resource: drbd_dev (class=ocf provider=linbit type=drbd)
> Attributes: drbd_resource=drbd0
> Operations: start interval=0s timeout=240 (drbd_dev-start-interval-0s)
> promote interval=0s timeout=90 (drbd_dev-promote-interval-0s)
> demote interval=0s timeout=90 (drbd_dev-demote-interval-0s)
> stop interval=0s timeout=100 (drbd_dev-stop-interval-0s)
> monitor interval=29s role=Master
> (drbd_dev-monitor-interval-29s)
> monitor interval=31s role=Slave
> (drbd_dev-monitor-interval-31s)
> Resource: drbd_fs (class=ocf provider=heartbeat type=Filesystem)
> Attributes: device=/dev/drbd0 directory=/exports/drbd0 fstype=xfs
> Operations: start interval=0s timeout=60 (drbd_fs-start-interval-0s)
> stop interval=0s timeout=60 (drbd_fs-stop-interval-0s)
> monitor interval=20 timeout=40 (drbd_fs-monitor-interval-20)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> start nfsVIP then start nfs-server (kind:Mandatory)
> (id:order-nfsVIP-nfs-server-mandatory)
> start drbd_fs then start nfs-server (kind:Mandatory)
> (id:order-drbd_fs-nfs-server-mandatory)
> promote drbd_master then start drbd_fs (kind:Mandatory)
> (id:order-drbd_master-drbd_fs-mandatory)
> Colocation Constraints:
> nfs-server with nfsVIP (score:INFINITY)
> (id:colocation-nfs-server-nfsVIP-INFINITY)
> nfs-server with drbd_fs (score:INFINITY)
> (id:colocation-nfs-server-drbd_fs-INFINITY)
> drbd_fs with drbd_master (score:INFINITY) (with-rsc-role:Master)
> (id:colocation-drbd_fs-drbd_master-INFINITY)
>
> Resources Defaults:
> resource-stickiness: 100
> failure-timeout: 60
> Operations Defaults:
> No defaults set
>
> Cluster Properties:
> cluster-infrastructure: corosync
> cluster-name: nfscluster
> dc-version: 1.1.13-10.el7_2.2-44eb2dd
> have-watchdog: false
> maintenance-mode: false
> stonith-enabled: false
Configure *and test* stonith in pacemaker first, then DRBD will hook
into it and use it properly. DRBD simply asks pacemaker to do the fence,
but you currently don't have it setup.
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
More information about the Users
mailing list