[Pacemaker] Seeking suggestions for cluster configuration of HA iSCSI target and initiators
Phil Frost
phil at macprofessionals.com
Mon Jul 16 16:08:50 UTC 2012
I'm designing a cluster to run both iSCSI targets and initiators to
ultimately provide block devices to virtual machines. I'm considering
the case of a target failure, and how to handle that as gracefully as
possible. Ideally, IO may be paused until the target recovers, but VMs
do not restart or see IO errors.
I've observed that the iscsi RA will configure the initiator to retry
connections indefinitely if the target should fail. This is mostly good,
except that if the initiator is in the retrying state, the monitor
action will return an error.
The Right Thing to do in this case, I would think, would be to just
wait. Of course the initiators can't work if the target is down, but the
initiators will recover automatically when the target recovers. Ideally
the cluster would wait for the target (which it also manages) to
recover, then try again to monitor the initiators. For good measure, it
might try monitoring the initiators a couple times, since it can take
them a moment to reconnect.
Unfortunately, what actually happens is the monitor action on the
initiator fails. Pacemaker then attempts to stop the initiator, and that
also fails, because the target is still unavailable. Then the initiator
node gets STONITHed, taking out all the hosted VMs with it.
I added a mandatory, non-symmetrical order constraint of target ->
initiator, so at least Pacemaker will not attempt to re-start the
initiator after a target failure. I made it asymetrical so that restarts
of the target do not force restarts of the initiator. However, it
doesn't do much to help the failed-target case.
What's a good solution? Is there some way to suspend monitoring of the
initiators if pacemaker knows the target is failed? I suppose I could
modify the iscsi RA to return success for monitor in the case that the
initiator is attempting to reconnect to the target, but then what if
actually the initiator has failed, and the target is operational? What
then about race conditions that might exist in cases where the target
has failed, but pacemaker has not yet detected the target failure though
a monitor operation?
More information about the Pacemaker
mailing list