[Pacemaker] Pacemaker on system with disk failure
Carsten Otto
carsten.otto at andrena.de
Tue Sep 23 13:39:45 UTC 2014
Hello,
I run Corosync + Pacemaker + DRBD in a two node cluster, where all
resources are part of a group/colocated with DRBD (DRBD + virtual IP +
filesystem + ...). To test my configuration, I currently have two nodes
with only a single disk drive. This drive is the only LVM physical
drive in a LVM volume group, where the Linux system resides on some
logical volumes and the disk exported by DRBD is another logical volume.
When I now unplug power of the disk drive on the node running the
resources (DRBD is primary), this gets noticed by DRBD ("diskless").
Furthermore, I notice that my services do not work anymore (which is
understandable without a working disk drive).
However, in my experiments one of the following problems occurs:
1) The services are stopped and DRBD is demoted (according to pcs status
and pacemaker.log), however according to /proc/drbd on the surviving
node, the diskless node still is running as primary. As a
consequence, I see failing attempts to promote on the surviver node:
drbd(DRBD)[1797]: 2014/09/23_14:35:56 ERROR: disk0: Called drbdadm -c /etc/drbd.conf primary disk0
drbd(DRBD)[1797]: 2014/09/23_14:35:56 ERROR: disk0: Exit code 11
The problem here seems to be:
crmd: info: match_graph_event: Action DRBD_demote_0 (12) confirmed on diskless_node (rc=0)
While this demote operation obviously should not be confirmed, I also
strongly believe that running the stop operations of the standard
resources works without having access to the resource agent scripts
(which are on the failed disk) and the tools used by them.
2) My services do not work anymore, but nothing happens in the cluster.
Everything looks like it did before the failure, with the only
difference that /proc/drbd shows "Diskless" and some "oos". It
seems corosync/pacemaker sends out "all is well" to the DC, while
internally (due to the missing disk) nothing works. I guess that
running all sorts of monitor scripts is problematic without having
access to the actual files, so I'd like to see some sort of failure
communicated from the diskless node to the surviving node (or, having
the surviving node come to the same conclusion due to some timeout).
Is this buggy behaviour? How should a node behave if all disks stopped
working?
I can reproduce this. If you need details about the configuration or
more output from pacemaker.log, please just tell me so.
The versions reported by Centos 7:
corosync 2.3.3-2.el7
pacemaker 1.1.10-32.el7_0
drbd 8.4.5-1.el7.elrepo
Thank you,
Carsten
--
andrena objects ag
Büro Frankfurt
Clemensstr. 8
60487 Frankfurt
Tel: +49 (0) 69 977 860 38
Fax: +49 (0) 69 977 860 39
http://www.andrena.de
Vorstand: Hagen Buchwald, Matthias Grund, Dr. Dieter Kuhn
Aufsichtsratsvorsitzender: Rolf Hetzelberger
Sitz der Gesellschaft: Karlsruhe
Amtsgericht Mannheim, HRB 109694
USt-IdNr. DE174314824
Bitte beachten Sie auch unsere anstehenden Veranstaltungen:
http://www.andrena.de/events
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 198 bytes
Desc: Digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140923/f314c23f/attachment-0003.sig>
More information about the Pacemaker
mailing list