[Pacemaker] LIO + Pacemaker kernel oops on failover

Phil Frost phil at macprofessionals.com
Fri Jul 13 17:41:23 CEST 2012


On 07/03/2012 02:38 PM, Phil Frost wrote:
> It seems there's something about the iSCSI RAs that hit a bug in LIO:
>
> http://comments.gmane.org/gmane.linux.scsi.target.devel/1568?set_cite=hide 
>
>
> I seem to be hitting the same problem quite reliably whenever I 
> migrate the iSCSI targets in my cluster. Sounds like the OP was able 
> to reach a suitable workaround, but I'm not very experienced with LIO 
> or iSCSI so the discussion is a bit over my head. Anyone have some 
> idea how to implement the changes described there?

I wasn't able to find a way to modify the existing 
iSCSI(Target|LogicalUnit) RAs to stop the target in a way that avoided 
this bug in LIO. The problem was largely that with targets and logical 
units as separate resources, it was difficult to start the target before 
the LUs, and also stop the target before the LUs. I tried using 
asymmetric order constraints, but it didn't work so well in testing. I 
don't know if it's because the shutdown wasn't working cleanly, or if 
the iSCSILogicalUnit resources were upset that the LUs were stopped when 
Pacemaker wasn't expecting it.

Anyhow, my solution was to write a new RA (attached) which managed the 
target and the LUs together, and thus could control the ordering of 
starting and stopping them in detail. It's not as featureful or general 
as the existing RAs, but in my testing so far it is stable.

This is the first RA I have written, so I would appreciate any comments. 
One problem in particular relates to the monitor action -- you can see 
it only checks that the target is running. I could add monitoring for 
the LUs easily enough, but I'm not clear on what should happen if the 
target is up, but the LUs are not. In this state the service is neither 
"up" nor "down", it's broken, and the right thing to do is probably 
attempt to restart it. I'm not sure how I communicate that to Pacemaker 
from my RA, though. Should I return OCF_ERR_GENERIC? What will pacemaker 
do is this case?
-------------- next part --------------
#!/bin/bash


: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/resource.d/heartbeat}
. ${OCF_FUNCTIONS_DIR}/.ocf-shellfuncs

VERSION="0.1"
OCF_RESKEY_portals_default="0.0.0.0:3260"

meta_data() {
        cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="liotarget" version="${VERSION}">
<version>${VERSION}</version>

<longdesc lang="en">
Manages a LIO iSCSI target and associated logical units.
</longdesc>
<shortdesc lang="en">Manages LIO targets</shortdesc>

<parameters>
    <parameter name="iqn" required="1" unique="0">
        <longdesc lang="en">
The iSCSI Qualified Name (IQN) for this target.
        </longdesc>
        <shortdesc lang="en">iSCSI target IQN</shortdesc>
        <content type="string" />
    </parameter>

    <parameter name="portals" required="0" unique="0">
        <longdesc lang="en">
iSCSI network portal addresses. If unset, the default is to create one portal
that listens on ${OCF_RESKEY_portals_default}.
        </longdesc>
        <shortdesc lang="en">iSCSI portal addresses</shortdesc>
        <content type="string" default="${OCF_RESKEY_portals_default}"/>
    </parameter>

    <parameter name="luns" required="1" unique="0">
        <longdesc lang="en">
The logical units to create as part of this target. Each logical unit is
specified as a LUN:path pair. Separate multiple logical units with spaces. Use
bash syntax to escape special characters.
        </longdesc>
        <shortdesc lang="en">Logical Units</shortdesc>
        <content type="string" />
    </parameter>

</parameters>

<actions>
    <action name="start"        timeout="10" />
    <action name="stop"         timeout="10" />
    <action name="status"       timeout="10" interval="10" depth="0" />
    <action name="monitor"      timeout="10" interval="10" depth="0" />
    <action name="meta-data"    timeout="5" />
    <action name="validate-all" timeout="10" />
</actions>

</resource-agent>
END
}

liotarget_usage() {
        cat <<END
usage: $0 {start|stop|status|monitor|validate-all|meta-data}

Expects to have a fully populated OCF RA-compliant environment set.
END
}

liotarget_start() {
    ocf_log info "starting"

    liotarget_monitor
    if [ $? -eq $OCF_SUCCESS ]; then
        # already running. We are good here.
        return $OCF_SUCCESS
    fi

    # lio distinguishes between targets and target portal
    # groups (TPGs). We will always create one TPG, with the
    # number 1. In lio, creating a network portal
    # automatically creates the corresponding target if it
    # doesn't already exist.
    for portal in ${OCF_RESKEY_portals}; do
        ocf_run lio_node --addnp "${OCF_RESKEY_iqn}" 1 "${portal}" || exit $OCF_ERR_GENERIC
    done
    # lio does per-initiator filtering by default. To disable
    # this, we need to switch the target to "permissive mode".
    ocf_run lio_node --permissive "${OCF_RESKEY_iqn}" 1 || exit $OCF_ERR_GENERIC
    # permissive mode enables read-only access by default,
    # so we need to change that to RW to be in line with
    # the other implementations.
    echo 0 > "/sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/attrib/demo_mode_write_protect"
    if [ `cat /sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/attrib/demo_mode_write_protect` -ne 0 ]; then
        ocf_log err "Failed to disable write protection for target ${OCF_RESKEY_iqn}."
        exit $OCF_ERR_GENERIC
    fi
    # TODO: add CHAP authentication support when it gets added
    # back into LIO
    ocf_run lio_node --disableauth ${OCF_RESKEY_iqn} 1 || exit $OCF_ERR_GENERIC
    # Finally, we need to enable the target to allow
    # initiators to connect
    ocf_run lio_node --enabletpg=${OCF_RESKEY_iqn} 1 || exit $OCF_ERR_GENERIC


    ocf_log info "starting luns"
    for lun_and_path in ${OCF_RESKEY_luns}; do
        lun=${lun_and_path%%:*}
        path=${lun_and_path#*:}
        ocf_log info "starting lun ${lun} at path ${path}"

        devname="${OCF_RESOURCE_INSTANCE}_${lun}"
        iblock="iblock_0/${devname}"

        ocf_run tcm_node --createdev="${iblock}" "${path}" || exit $OCF_ERR_GENERIC
        ocf_run lio_node --addlun="${OCF_RESKEY_iqn}" 1 "${lun}" \
            "${devname}" "${iblock}" || exit $OCF_ERR_GENERIC
    done

    return $OCF_SUCCESS
}

liotarget_stop() {
    ocf_log info "stopping"

    ocf_run lio_node --disabletpg=${OCF_RESKEY_iqn} 1

    for lun_and_path in ${OCF_RESKEY_luns}; do
        lun=${lun_and_path%%:*}
        ocf_log info "stopping lun ${lun}"

        ocf_run lio_node --dellun="${OCF_RESKEY_iqn}" 1 "${lun}"
    done

    ocf_run lio_node --deltpg=${OCF_RESKEY_iqn} 1

    for lun_and_path in ${OCF_RESKEY_luns}; do
        lun=${lun_and_path%%:*}
        devname="${OCF_RESOURCE_INSTANCE}_${lun}"
        iblock="iblock_0/${devname}"
        ocf_log info "freeing iblock ${iblock}"

        ocf_run tcm_node --freedev="${iblock}" "${path}"
    done

    ocf_run lio_node --deliqn=${OCF_RESKEY_iqn}

    liotarget_monitor
    if [ $? -eq $OCF_NOT_RUNNING ]; then
        return $OCF_SUCCESS
    else
        return $OCF_ERR_GENERIC
    fi
}

liotarget_monitor() {
    ocf_log info "monitoring"

    # if we have no configfs entry for the target, it's
    # definitely stopped
    [ -d /sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn} ] || return $OCF_NOT_RUNNING
    # if the target is there, but its TPG is not enabled, then
    # we also consider it stopped
    [ `cat "/sys/kernel/config/target/iscsi/${OCF_RESKEY_iqn}/tpgt_1/enable"` -eq 1 ] || return $OCF_NOT_RUNNING
    return $OCF_SUCCESS
}

liotarget_validate() {
    return $OCF_SUCCESS
}


case $1 in
  meta-data)
        meta_data
        exit $OCF_SUCCESS
        ;;
  usage|help)
        liotarget_usage
        exit $OCF_SUCCESS
        ;;
esac

# Everything except usage and meta-data must pass the validate test
liotarget_validate

case $__OCF_ACTION in
start)          liotarget_start;;
stop)           liotarget_stop;;
monitor|status) liotarget_monitor;;
validate-all)   ;;
*)              liotarget_usage
                exit $OCF_ERR_UNIMPLEMENTED
                ;;
esac
rc=$?
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"
exit $rc


More information about the Pacemaker mailing list