[Pacemaker] Why monitor fails in my RA
Greg
test123 at implix.com
Fri Apr 27 14:18:38 UTC 2012
On day 04/27/12 03:43, Andrew Beekhof wrote:
[cut]
>> My mointor function (simplified and removed overhead and added some
>> comments) is:
>> redis_monitor() {
>> # I set score 10 for master 5 is for slave
>> CURSCORE=`$CRM_MASTER -G -q`
>> logger "redis_monitor: score $CURSCORE"
>> local state
>> redis_state
>>
>> # In RET is current local redis state
>> state=$(echo "${RET}" | cut -d':' -f2 | tr -d '\r')
>>
>> if [ "${state}" = "master" ];then
>> $CRM_MASTER -v $CRM_MASTER_SCORE # score is 10
>> exit $OCF_RUNNING_MASTER
>> fi
>>
>> if [ "${state}" = "slave" ];then
>> $CRM_MASTER -v $CRM_SLAVE_SCORE # score is 5
>> exit $OCF_SUCCESS
>> fi
>>
>> # if not slave/master so resource is failed
>> $CRM_MASTER -l reboot -D
>> if [ $CURSCORE -eq $CRM_MASTER_SCORE ];then
>> exit $OCF_FAILED_MASTER
>> fi
>>
>> exit $OCF_NOT_RUNNING
>
> Are you sure its NOT_RUNNING?
> Could it also be running but generally failed?
redis_state function set RET variable to master for Master /slave for
slave /empty for not running or hanging. If redis is not master nor
slave so it should be restarted.
>> From my logs I know that monitoring function returned OCF_FAILED_MASTER when
>> master is down and then this error occurred:
>> redis-server:0_monitor_5000 (node=s1, call=16, rc=9, status=complete):
>> master (failed)
>>
>> After that failed master node is not monitored on that node until I run
>> cleanup:
>> #crm resource cleanup redis-server:0
>>
>>
>> My questions:
>> 1) What I'm doing wrong ?. How can I fix this.
>> I've tried on-fail="restart" but this not helped
>
> You'd need to supply more information (in the form of a hb_report tarball).
> An upgrade might not hurt either.
No new version for debina in squeeze-backports :(
Rapport attached. I've change monitor a little bit and now check state
using OCF_RESKEY_CRM_meta_role but I still has the same problems. Test
scenario is running master on s1 and after a while i kill redis on s1 .
S2 became master and after almost 2 minutes I do the same on s2 - kill
redis process. Redis on S1 became master. (that all in report).
After kill redis on S2 error occurs:
redis-server:0_monitor_5000 (node=s1, call=16, rc=9, status=complete):
master (failed)
Now if S2 became master redis on that node is never monitored again (if
S2 is slave for redis). It's very strange that this error never happen
if I kill redis for the first time on S1.
redis_monitor() {
local CURSTATE
local state
# One can use (undocumented ?)
#OCF_RESKEY_CRM_meta_role=Slave
#OCF_RESKEY_CRM_meta_role=Master
CURSTATE=$(echo ${OCF_RESKEY_CRM_meta_role} | tr [A-Z] [a-z])
logger "redis_monitor: current state: $CURSTATE"
# check redis state
redis_state
state=$(echo "${RET}" | cut -d':' -f2 | tr -d '\r')
logger "redis_monitor: redis state $state"
# CRM says redis is master:
if [ "${CURSTATE}" = "master" ];then
if [ "${state}" = "master" ];then
logger "redis_monitor 1 $OCF_RUNNING_MASTER"
$CRM_MASTER -v $CRM_MASTER_SCORE
exit $OCF_RUNNING_MASTER
else
logger "redis_monitor: CRM says master but
redis says other thing"
$CRM_MASTER -D
exit $OCF_FAILED_MASTER
fi
fi
# CRM says redis is slave:
if [ "${CURSTATE}" = "slave" ];then
if [ "${state}" = "slave" ];then
logger "redis_monitor 2 $OCF_SUCCESS"
# TODO - w przyszlosci dodatkowe testy np.
zapis odczy klucza/sprawdzenie czy replikacja dziala itp.
$CRM_MASTER -v $CRM_SLAVE_SCORE
exit $OCF_SUCCESS
else
logger "redis_monitor: CRM says slave but redis
says other thing"
$CRM_MASTER -D
exit $OCF_NOT_RUNNING
fi
fi
# State not defined (not in master-slave state)
if [ "${CURSTATE}" = "" ];then
if [ "${state}" = "" ];then
logger "redis_monitor pre-end $OCF_NOT_RUNNING"
$CRM_MASTER -D
exit $OCF_NOT_RUNNING
else
logger "redis_monitor pre-end $OCF_SUCCESS"
$CRM_MASTER -v $CRM_SLAVE_SCORE
exit $OCF_SUCCESS
fi
fi
# It's impossible to get here but safe to keep it
$CRM_MASTER -D
logger "redis_monitor end $OCF_NOT_RUNNING"
exit $OCF_NOT_RUNNING
}
>
>>
>> 2) Using older version of redis 2.3 If master failed redis is hanging for
>> some time (21-24 seconds). Even I set higher timeout on monitor functions it
>> still timeout after 20 seconds why?.
>
> How did you set the timeout higher?
>
By setting:
default-action-timeout="60s"
I think that monitor timeout should be sufficient but operation was
stopped afeter. Error like that:
Apr 26 16:25:37 SREVERXXX lrmd: [18777]: debug: on_msg_perform_op: add
an operation operation monitor[3] on ocf::redis::redis-serv:0 for client
18780, its parameters: vservers=[redis-2,redis-1]
CRM_meta_master_max=[1] CRM_meta_timeout=[20000] CRM_meta_clone_max=[2]
CRM_meta_master_node_max=[1] crm_feature_set=[3.0.1]
CRM_meta_globally_unique=[false] masterip=[X.X.X.X] CRM_meta_clone=[0]
CRM_meta_clone_node_max=[1] CRM_meta_notify=[false] to the operation list.
Another question is that: What value should demote function return if
node (master) is down. I return OCF_NOT_RUNNING and get this failed:
Failed actions:
redis-server:0_demote_0 (node=s1, call=61, rc=7, status=complete):
not running
--
Greg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report2.tar.bz2
Type: application/x-bzip2
Size: 65281 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120427/64ee1c91/attachment-0004.bz2>
More information about the Pacemaker
mailing list