[Pacemaker] "rc=-2" when executing monitors

Wed Aug 26 11:15:42 UTC 2009

Hi,

	we've recently deployed a two-node cluster using pacemaker, and we're
seeing a strange thing in the logs: from time to time, the monitor operation
fails with "rc=-2". This is an example:

crmd: [5506]: info: process_lrm_event: LRM operation rinetd_monitor_60000
(call=427, rc=-2, cib-update=0, confirmed=true) Cancelled unknown exec
error 

crmd: [5506]: info: process_lrm_event: LRM operation rinetd_monitor_60000
(call=464, rc=-2, cib-update=0, confirmed=true) Cancelled unknown exec error

	This happens not only with LSB, but also with OCF resources; though
in both cases they are made by us, so we may be doing something wrong. In one
case, we use "ps" and "grep" to find the process we're monitoring:

        ps ax | grep -v grep | egrep "ftp-proxy .* ${OCF_RESKEY_config}" >/dev/null 2>/dev/null
        if [ $? -eq 0 ];then
                return $OCF_SUCCESS
        else
                return $OCF_NOT_RUNNING
        fi

	This is, as you may guess, an OCF for ftp-proxy. I know that we
should be using a different error instead of just returning OCF_NOT_RUNNING
for everything not 0, and we've fixed it already. But I wanted to show you
the OCF exactly like it was when the failures ocurred, just in case it has
something to do. In fact, returning OCF_NOT_RUNNING (rc=7, IIRC) for
every return code not 0 makes it more strange that there is a rc=-2 in the
logs.

	And this is the monitor snippet for an LSB script for rinetd, the one
whose errors I pasted above:

        pidof rinetd > /dev/null
        R=$?
        case "$R" in
                0)
                        echo "rinetd is running (PID `pidof 'rinetd'`)"
                        exit 0
                        ;;
                *)
                        echo "rinetd is NOT running (rc=$R)"
                        exit 3
                        ;;
        esac

	As you can see, here there is also no way of returning something
different from 0 or 3.

	What may be the cause of these strange failures that we're seeing? If
more info or testing is needed, please ask.

	Thanks in advance.

-- 
        Roberto Suarez Soto                             Allenta Consulting
        robe at allenta.com                                   www.allenta.com
                                                           +34 881 922 600