[Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators

Mon Jul 16 13:15:32 EDT 2012

On 07/16/2012 01:14 PM, Digimer wrote:
> On 07/16/2012 12:08 PM, Phil Frost wrote:
>> I'm designing a cluster to run both iSCSI targets and initiators to
>> ultimately provide block devices to virtual machines. I'm considering
>> the case of a target failure, and how to handle that as gracefully as
>> possible. Ideally, IO may be paused until the target recovers, but VMs
>> do not restart or see IO errors.
>>
>> I've observed that the iscsi RA will configure the initiator to retry
>> connections indefinitely if the target should fail. This is mostly good,
>> except that if the initiator is in the retrying state, the monitor
>> action will return an error.
>>
>> The Right Thing to do in this case, I would think, would be to just
>> wait. Of course the initiators can't work if the target is down, but the
>> initiators will recover automatically when the target recovers. Ideally
>> the cluster would wait for the target (which it also manages) to
>> recover, then try again to monitor the initiators. For good measure, it
>> might try monitoring the initiators a couple times, since it can take
>> them a moment to reconnect.
>>
>> Unfortunately, what actually happens is the monitor action on the
>> initiator fails. Pacemaker then attempts to stop the initiator, and that
>> also fails, because the target is still unavailable. Then the initiator
>> node gets STONITHed, taking out all the hosted VMs with it.
>>
>> I added a mandatory, non-symmetrical order constraint of target ->
>> initiator, so at least Pacemaker will not attempt to re-start the
>> initiator after a target failure. I made it asymetrical so that restarts
>> of the target do not force restarts of the initiator. However, it
>> doesn't do much to help the failed-target case.
>>
>> What's a good solution? Is there some way to suspend monitoring of the
>> initiators if pacemaker knows the target is failed? I suppose I could
>> modify the iscsi RA to return success for monitor in the case that the
>> initiator is attempting to reconnect to the target, but then what if
>> actually the initiator has failed, and the target is operational? What
>> then about race conditions that might exist in cases where the target
>> has failed, but pacemaker has not yet detected the target failure though
>> a monitor operation?
> 
> I've only tested this a little, so please take it as a general
> suggestion rather than strong advice.
> 
> I created a two-node cluster, using red hat's high-availability add-on,
> using DRBD to keep the data replicated between the two "SAN" nodes and
> tgtd to export the LUNs. I had a virtual IP on the cluster to act as the
> target IP and I had DRBD in dual-primary mode with clustered LVM (so I
> had DRBD as the PV and exported the space from the LVs).
> 
> Then I built a second cluster of five nodes to host KVM VMs. The
> underlying nodes used clustered LVM as well, but this time the LUNs was
> the PV. I carved this up into an LV per VM and made the VMs the HA
> service. Again using RH HA-Addon.
> 
> In this setup, I was able to fail over the SAN without losing any VMs. I
> even messed up the fencing on the SAN cluster once, which meant it took
>> 30s to fail over, and I didn't lose the VMs. So to the minimal extent I
> tested it, it worked excellently.
> 
> I have some very rough notes on this setup. They're not fit for public
> consumption at all, but if you'd like I'll send them to you directly.
> They include the configurations which might help as a template or similar.
> 
> Digimer

Oh woops, I just realized this was the pacemaker list, not the general
Linux Clustering list. Heh. Doing all of the management using pacemaker
instead of RH's HA-Addon should be just fine, too.

-- 
Digimer
Papers and Projects: https://alteeve.com