[ClusterLabs] Approach to validate on stop op (Was Re: crmsh configure delete for constraints)
Vladislav Bogdanov
bubble at hoster-ok.com
Tue Mar 29 12:28:39 UTC 2016
10.02.2016 12:31, Vladislav Bogdanov wrote:
> 10.02.2016 11:38, Ulrich Windl wrote:
>>>>> Vladislav Bogdanov <bubble at hoster-ok.com> schrieb am 10.02.2016 um
>>>>> 05:39 in
>> Nachricht <6E479808-6362-4932-B2C6-348C7EFC4020 at hoster-ok.com>:
>>
>> [...]
>>> Well, I'd reword. Generally, RA should not exit with error if validation
>>> fails on stop.
>>> Is that better?
>> [...]
>>
>> As we have different error codes, what type of error?
>
> Any which makes pacemaker to think resource stop op failed.
> OCF_ERR_* particularly.
>
> If pacemaker has got an error on start, it will run stop with the same
> set of parameters anyways. And will get error again if that one was from
> validation and RA does not differentiate validation for start and stop.
> And then circular fencing over the whole cluster is triggered for no
> reason.
>
> Of course, for safety, RA could save its state if start was successful
> and skip validation on stop only if that state is not found. Otherwise
> removed binary or config file would result in resource running on
> several nodes.
>
> Well, this all seems to be very complicated to make some general
> algorithm ;)
Well, after some thinking, I've got an approach which sounds both
elegant and safe enough to me and my colleagues. Please look at the
following excerpt (part of hypothetical RA before the main 'case'):
-----
VALIDATION_FAILURE_FLAG="${HA_RSCTMP}/${OCF_RESOURCE_INSTANCE}.invalid"
case "${__OCF_ACTION}" in
meta-data)
meta_data
exit $OCF_SUCCESS
;;
usage|help)
usage
exit $OCF_SUCCESS
;;
start)
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
touch "${VALIDATION_FAILURE_FLAG}"
exit ${ret}
fi
;;
stop)
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
if [ -f "${VALIDATION_FAILURE_FLAG}" ] ; then
rm -f "${VALIDATION_FAILURE_FLAG}"
exit $OCF_SUCCESS
else
exit ${ret}
fi
fi
;;
*) # monitor | notify | reload | etc
validate
ret=$?
if [ ${ret} -ne $OCF_SUCCESS ] ; then
if ocf_is_probe ; then
exit $OCF_NOT_RUNNING
fi
exit $?
fi
;;
esac
-----
Above assumes that validation function does not call exit (and thus uses
have_binary instead of check_binary, etc.) but returns an error code.
The main difference to the current ocf_rarun implementation is that
changes to machine environment (deleted binaries, configs, etc.) still
result in stop failure (and thus fencing) if that changes were made
after the successful validation on resource start.
I plan to extensively test such approach in my RAs shortly.
Comments are welcome.
Best,
Vladislav
>
>
>>
>> Regards,
>> Ulrich
>>
>>
>>
>> _______________________________________________
>> Users mailing list: Users at clusterlabs.org
>> http://clusterlabs.org/mailman/listinfo/users
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list