[Pacemaker] LIO + Pacemaker kernel oops on failover
Phil Frost
phil at macprofessionals.com
Fri Jul 13 15:41:23 UTC 2012
On 07/03/2012 02:38 PM, Phil Frost wrote:
> It seems there's something about the iSCSI RAs that hit a bug in LIO:
>
> http://comments.gmane.org/gmane.linux.scsi.target.devel/1568?set_cite=hide
>
>
> I seem to be hitting the same problem quite reliably whenever I
> migrate the iSCSI targets in my cluster. Sounds like the OP was able
> to reach a suitable workaround, but I'm not very experienced with LIO
> or iSCSI so the discussion is a bit over my head. Anyone have some
> idea how to implement the changes described there?
I wasn't able to find a way to modify the existing
iSCSI(Target|LogicalUnit) RAs to stop the target in a way that avoided
this bug in LIO. The problem was largely that with targets and logical
units as separate resources, it was difficult to start the target before
the LUs, and also stop the target before the LUs. I tried using
asymmetric order constraints, but it didn't work so well in testing. I
don't know if it's because the shutdown wasn't working cleanly, or if
the iSCSILogicalUnit resources were upset that the LUs were stopped when
Pacemaker wasn't expecting it.
Anyhow, my solution was to write a new RA (attached) which managed the
target and the LUs together, and thus could control the ordering of
starting and stopping them in detail. It's not as featureful or general
as the existing RAs, but in my testing so far it is stable.
This is the first RA I have written, so I would appreciate any comments.
One problem in particular relates to the monitor action -- you can see
it only checks that the target is running. I could add monitoring for
the LUs easily enough, but I'm not clear on what should happen if the
target is up, but the LUs are not. In this state the service is neither
"up" nor "down", it's broken, and the right thing to do is probably
attempt to restart it. I'm not sure how I communicate that to Pacemaker
from my RA, though. Should I return OCF_ERR_GENERIC? What will pacemaker
do is this case?
-------------- next part --------------
#!/bin/bash
: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs
VERSION="0.1"
OCF_RESKEY_portals_default="0.0.0.0:3260"
meta_data() {
cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="liotarget" version="${VERSION}">
<version>${VERSION}</version>
<longdesc lang="en">
Manages a LIO iSCSI target and associated logical units.
</longdesc>
<shortdesc lang="en">Manages LIO targets</shortdesc>
<parameters>
<parameter name="iqn" required="1" unique="0">
<longdesc lang="en">
The iSCSI Qualified Name (IQN) for this target.
</longdesc>
<shortdesc lang="en">iSCSI target IQN</shortdesc>
<content type="string" />
</parameter>
<parameter name="portals" required="0" unique="0">
<longdesc lang="en">
iSCSI network portal addresses. If unset, the default is to create one portal
that listens on ${OCF_RESKEY_portals_default}.
</longdesc>
<shortdesc lang="en">iSCSI portal addresses</shortdesc>
<content type="string" default="${OCF_RESKEY_portals_default}"/>
</parameter>
<parameter name="luns" required="1" unique="0">
<longdesc lang="en">
The logical units to create as part of this target. Each logical unit is
specified as a LUN:path pair. Separate multiple logical units with spaces. Use
bash syntax to escape special characters.
</longdesc>
<shortdesc lang="en">Logical Units</shortdesc>
<content type="string" />
</parameter>
</parameters>
<actions>
<action name="start" timeout="10" />
<action name="stop" timeout="10" />
<action name="status" timeout="10" interval="10" depth="0" />
<action name="monitor" timeout="10" interval="10" depth="0" />
<action name="meta-data" timeout="5" />
<action name="validate-all" timeout="10" />
</actions>
</resource-agent>
END
}
liotarget_usage() {
cat <<END
usage: $0 {start|stop|status|monitor|validate-all|meta-data}
Expects to have a fully populated OCF RA-compliant environment set.
END
}
liotarget_start() {
ocf_log info "starting"
liotarget_monitor
if [ $? -eq $OCF_SUCCESS ]; then
# already running. We are good here.
return $OCF_SUCCESS
fi
# lio distinguishes between targets and target portal
# groups (TPGs). We will always create one TPG, with the
# number 1. In lio, creating a network portal
# automatically creates the corresponding target if it
# doesn't already exist.
for portal in ${OCF_RESKEY_portals}; do
ocf_run lio_node --addnp "${OCF_RESKEY_iqn}" 1 "${portal}" || exit $OCF_ERR_GENERIC
done
# lio does per-initiator filtering by default. To disable
# this, we need to switch the target to "permissive mode".
ocf_run lio_node --permissive "${OCF_RESKEY_iqn}" 1 || exit $OCF_ERR_GENERIC
# permissive mode enables read-only access by default,
# so we need to change that to RW to be in line with
# the other implementations.
echo 0 > "/sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/attrib/demo_mode_write_protect"
if [ `cat /sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/attrib/demo_mode_write_protect` -ne 0 ]; then
ocf_log err "Failed to disable write protection for target ${OCF_RESKEY_iqn}."
exit $OCF_ERR_GENERIC
fi
# TODO: add CHAP authentication support when it gets added
# back into LIO
ocf_run lio_node --disableauth ${OCF_RESKEY_iqn} 1 || exit $OCF_ERR_GENERIC
# Finally, we need to enable the target to allow
# initiators to connect
ocf_run lio_node --enabletpg=${OCF_RESKEY_iqn} 1 || exit $OCF_ERR_GENERIC
ocf_log info "starting luns"
for lun_and_path in ${OCF_RESKEY_luns}; do
lun=${lun_and_path%%:*}
path=${lun_and_path#*:}
ocf_log info "starting lun ${lun} at path ${path}"
devname="${OCF_RESOURCE_INSTANCE}_${lun}"
iblock="iblock_0/${devname}"
ocf_run tcm_node --createdev="${iblock}" "${path}" || exit $OCF_ERR_GENERIC
ocf_run lio_node --addlun="${OCF_RESKEY_iqn}" 1 "${lun}" \
"${devname}" "${iblock}" || exit $OCF_ERR_GENERIC
done
return $OCF_SUCCESS
}
liotarget_stop() {
ocf_log info "stopping"
ocf_run lio_node --disabletpg=${OCF_RESKEY_iqn} 1
for lun_and_path in ${OCF_RESKEY_luns}; do
lun=${lun_and_path%%:*}
ocf_log info "stopping lun ${lun}"
ocf_run lio_node --dellun="${OCF_RESKEY_iqn}" 1 "${lun}"
done
ocf_run lio_node --deltpg=${OCF_RESKEY_iqn} 1
for lun_and_path in ${OCF_RESKEY_luns}; do
lun=${lun_and_path%%:*}
devname="${OCF_RESOURCE_INSTANCE}_${lun}"
iblock="iblock_0/${devname}"
ocf_log info "freeing iblock ${iblock}"
ocf_run tcm_node --freedev="${iblock}" "${path}"
done
ocf_run lio_node --deliqn=${OCF_RESKEY_iqn}
liotarget_monitor
if [ $? -eq $OCF_NOT_RUNNING ]; then
return $OCF_SUCCESS
else
return $OCF_ERR_GENERIC
fi
}
liotarget_monitor() {
ocf_log info "monitoring"
# if we have no configfs entry for the target, it's
# definitely stopped
[ -d /sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn} ] || return $OCF_NOT_RUNNING
# if the target is there, but its TPG is not enabled, then
# we also consider it stopped
[ `cat "/sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/enable"` -eq 1 ] || return $OCF_NOT_RUNNING
return $OCF_SUCCESS
}
liotarget_validate() {
return $OCF_SUCCESS
}
case $1 in
meta-data)
meta_data
exit $OCF_SUCCESS
;;
usage|help)
liotarget_usage
exit $OCF_SUCCESS
;;
esac
# Everything except usage and meta-data must pass the validate test
liotarget_validate
case $__OCF_ACTION in
start) liotarget_start;;
stop) liotarget_stop;;
monitor|status) liotarget_monitor;;
validate-all) ;;
*) liotarget_usage
exit $OCF_ERR_UNIMPLEMENTED
;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc
More information about the Pacemaker
mailing list