[Pacemaker] Two-node cluster fencing
Tim Bordeman
timbo_nospam at web.de
Tue Oct 22 13:52:05 UTC 2013
Am 2013-10-21 10:40, schrieb Michael Schwartzkopff:
> Am Montag, 21. Oktober 2013, 10:28:53 schrieb Timm Bordeman:
>> Hi,
>>
>> I'm building a two-node cluster based on XenServer, Pacemaker and
>> DRBD. All
>> needed resources are configured and correctly handled by pacemaker,
>> but
>> currently I'm struggling with stonith / fencing.
>>
>> Both physical servers are running XenServer and a couple of virtual
>> machines
>> which are being mirrored. For example, on each servers is an
>> Apache-VM
>> running which share a data partition over DRBD. I configured fencing
>> over
>> XEN, which is restarting any faulty VM reliable, as long as both
>> physical
>> servers are working correctly.
>>
>> Unfortunately fencing doesn't work when a server that hosts a faulty
>> virtual
>> machine is powered off or not reachable over the network. In this
>> case
>> pacemaker does not promote the DRBD partition on the second /
>> passive
>> virtual machine to the primary partition. Other resources, like the
>> apache
>> server, won't get started. I know that this is an expected behaviour
>> of
>> Pacemaker and DRBD, but I'm not sure what is needed to make the
>> failover
>> reliable even in the case of a completely broken physical server.
>> Fencing
>> by issuing a reboot of the broken server obviously is not an option
>> since
>> the server wouldn't come up due to a hardware defect.
>>
>> I appreciate any help on this.
>>
>> Thanks,
>> Tim
Hi,
I'm sorry for the delay.
> You considered that quorum does not work in a two-node cluster
> (option no-
> quorum-policy="ignore")?
Yes, I did (see [1]).
> The other possibility is that fencing does not reach the other server
> to run
> its commands successful.
Exactly. The agent fence_xenapi tries to fence the virtual machine, but
cannot connect to the physical host. It ends up with a "no route to
host" (see [6])
> Please check the logs and give more detail on your
> setup. What do you want to acchieve? Konfig? Logs?
Well, if a virtual machine cannot be fenced, I want the passive node
(after some retries or a delay) to become the primary one. I know that
this might lead to a split-brain under some circumstances, but I'm quite
confident that those situations happen very rarely and need to be
handled manually by an administrator.
Tim
[1] cib dump: http://pastebin.com/QWEJjJSZ
[2] corosync.conf: http://pastebin.com/zaQjDgPA
[3] r0.conf: http://pastebin.com/M6FnAfHu
[4] r1.conf: http://pastebin.com/SHd2Jdq7
[5] corosync.log (excerpt): http://pastebin.com/QHkUeNh1
[6] syslog (excerpt): http://pastebin.com/Zfd56mCE
More information about the Pacemaker
mailing list