[Pacemaker] Squid ocf agent. Temporaly unknown error stopping squid daemon.
Mauricio Esteban
mauriciomem at gmail.com
Tue May 7 22:56:26 UTC 2013
Hello,
I'm working on a ACTIVE/PASSIVE proxy cluster on debian 6.0.7. The cluster
has two resources configured, an OCF IPaddr2 agent and a squid OCF agent.
The IPaddr2 agent is working fine, but the squid agent shows a temporary
error when I manually stop the service (/etc/init.d/squid stop) in the node
with the
running resource.
If I stop the resource, restart the server o put on standby the node
running the squid agent, the service stops fine and passes it to the other
node.
The only issue appears during the stop of the service.
The servers (debian 6.0.7) kernel and packages version:
uname -r
2.6.32-5-amd64
pacemaker 1.0.9.1+hg15626-1
heartbeat 1:3.0.3-2
libheartbeat2 1:3.0.3-2
Pacemaker configuration:
crm configure show
node $id="433a22e8-9620-4889-b407-47125a40d4ae" proxyfailovernoc \
attributes standby="off"
node $id="dfb4a72b-24bb-4410-809d-514952f68e76" proxynoc \
attributes standby="off"
primitive ClusterIP ocf:heartbeat:IPaddr2 \
params ip="10.5.15.42" cidr_netmask="32" nic="eth0" \
op monitor interval="10s"
primitive ClusterSquid ocf:heartbeat:Squid \
params squid_exe="/usr/sbin/squid"
squid_conf="/etc/squid/squid.conf" squid_pidfile="/var/run/squid.pid"
squid_port="80" squid_stop_timeout="30" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="120s" \
op monitor interval="2s" timeout="30s" \
meta target-role="Started" failure-timeout="30s"
property $id="cib-bootstrap-options" \
dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
cluster-infrastructure="Heartbeat" \
stonith-enabled="false" \
last-lrm-refresh="1367960908" \
expected-quorum-votes="2"
This is my pacemaker status before stopping squid:
crm_mon -1ro
============
Last updated: Tue May 7 18:08:36 2013
Stack: Heartbeat
Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) -
partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ proxyfailovernoc proxynoc ]
Full list of resources:
ClusterIP (ocf::heartbeat:IPaddr2): Started proxyfailovernoc
ClusterSquid (ocf::heartbeat:Squid): Started proxynoc
Operations:
* Node proxynoc:
ClusterIP: migration-threshold=1000000
+ (6) start: rc=0 (ok)
+ (7) monitor: interval=10000ms rc=0 (ok)
+ (9) stop: rc=0 (ok)
ClusterSquid: migration-threshold=1000000
+ (12) monitor: interval=2000ms rc=0 (ok)
* Node proxyfailovernoc:
ClusterIP: migration-threshold=1000000
+ (10) stop: rc=0 (ok)
+ (11) start: rc=0 (ok)
+ (12) monitor: interval=10000ms rc=0 (ok)
crm_verify -LVVV
crm_verify[5411]: 2013/05/07_18:10:43 info: main: =#=#=#=#= Getting XML
=#=#=#=#=
crm_verify[5411]: 2013/05/07_18:10:43 info: main: Reading XML from: live
cluster
crm_verify[5411]: 2013/05/07_18:10:43 info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node
proxynoc is online
crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node
proxyfailovernoc is online
crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print: ClusterIP
(ocf::heartbeat:IPaddr2): Started proxyfailovernoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print:
ClusterSquid (ocf::heartbeat:Squid): Started proxynoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: RecurringOp: Start recurring
monitor (2s) for ClusterSquid on proxynoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource
ClusterIP (Started proxyfailovernoc)
crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource
ClusterSquid (Started proxynoc)
Squid status before stopping daemon
netstat -anpt
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address
State PID/Program name
tcp 0 0 0.0.0.0:80 0.0.0.0:*
LISTEN 2068/(squid)
cat /var/run/squid.pid
2068
ps auxf | grep squid
root 2063 0.0 0.0 21832 660 ? Ss 18:00 0:00
/usr/sbin/squid -f /etc/squid/squid.conf
proxy 2068 0.0 0.1 29864 6408 ? S 18:00 0:00 \_
(squid) -f /etc/squid/squid.conf
At the moment I run /etc/init.d/squid stop, appears in the HA log:
tail -f /var/log/ha.log
Squid[7405]: 2013/05/07_18:17:36 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]: 2013/05/07_18:17:37 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]: 2013/05/07_18:17:38 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]: 2013/05/07_18:17:39 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]: 2013/05/07_18:17:39 ERROR: squid:Inconsistency of processes
remains unsolved
May 07 18:17:39 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=12, rc=1, cib-update=25,
confirmed=false)
unknon error
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update
relayed from proxyfailovernoc
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_local_callback:
Expanded fail-count-ClusterSquid=value++ to 1
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-ClusterSquid (1)
May 07 18:17:41 proxynoc lrmd: [1423]: info: cancel_op: operation
monitor[12] on ocf::Squid::ClusterSquid for client 1426, its parameters:
squid_exe=[/usr/sbn/squid] squid_port=[80] crm_feature_set=[3.0.1]
squid_stop_timeout=[30] squid_pidfile=[/var/run/squid.pid]
CRM_meta_name=[monitor]
squid_conf=[/etc/squid/sqid.conf] CRM_meta_timeout=[30000]
CRM_meta_interval=[2000] cancelled
May 07 18:17:41 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=2:9:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_stop_0 )
May 07 18:17:41 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:13: stop
May 07 18:17:41 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=12, status=1, cib-update=0,
confirmed=true)
Canelled
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent
update 17: fail-count-ClusterSquid=1
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update
relayed from proxyfailovernoc
May 07 18:17:41 proxynoc attrd: [1425]: info: find_hash_entry: Creating
hash entry for last-failure-ClusterSquid
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-ClusterSquid (1367961460)
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent
update 20: last-failure-ClusterSquid=1367961460
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) ls:
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) cannot access /proc/2068/exe
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) : No such file or directory
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr)
Squid[7471]: 2013/05/07_18:17:42 INFO: squid:stop_squid:315: stop NORM
1/30
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) ls:
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) cannot access /proc/2068/exe
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) : No such file or directory
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr)
May 07 18:17:42 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_stop_0 (call=13, rc=0, cib-update=26,
confirmed=true) ok
May 07 18:17:43 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=8:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_start_0 )
May 07 18:17:43 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:14: start
Squid[7504]: 2013/05/07_18:17:43 INFO: squid:Waiting for squid to be
invoked
Squid[7504]: 2013/05/07_18:17:45 INFO: squid:Waiting for squid to be
invoked
May 07 18:17:46 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_start_0 (call=14, rc=0, cib-update=27,
confirmed=true) ok
May 07 18:17:47 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=9:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c
op=ClusterSquid_monitor_2000 )
May 07 18:17:47 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:15: monitor
May 07 18:17:47 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=15, rc=0, cib-update=28,
confirmed=false) ok
This status is only shown between 2 to 4 seconds and everything becomes as
before:
crm_mon
============
Last updated: Tue May 7 18:19:22 2013
Stack: Heartbeat
Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) -
partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============
Online: [ proxyfailovernoc proxynoc ]
ClusterIP (ocf::heartbeat:IPaddr2): Started proxyfailovernoc
ClusterSquid (ocf::heartbeat:Squid): Started proxynoc FAILED
Failed actions:
ClusterSquid_monitor_2000 (node=proxynoc, call=15, rc=1,
status=complete): unknown error
The logs shows at the beginning some inconsistent pid numbers, and a
standard error thrown by de RA output. Searching in the squid ocf script
(/usr/lib/ocf/resource.d/heartbeat/Squid), the function get_pids() take the
squid daemon pid from different sources.
If I execute the three commands it shows:
# Seek by pattern
SQUID_PIDS[0]= 8205
# Seek by pidfile
SQUID_PIDS[1]= 8207
# Seek by port
SQUID_PIDS[2]= 8207
At first glance, the problem seems to be at the search through the pattern.
If i change the command store in the variable, I can correct the value:
#SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")
SQUID_PIDS[0]=$(pgrep -f "\(squid\) -f /etc/squid/squid.conf")
with this change all searches find the same ips
get_pids()
{
SQUID_PIDS=()
# Seek by pattern
SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")
# Seek by pidfile
SQUID_PIDS[1]=$(awk '1{print $1}' $SQUID_PIDFILE 2>/dev/null)
if [[ -n "${SQUID_PIDS[1]}" ]]; then
typeset exe
exe=$(ls -l "/proc/${SQUID_PIDS[1]}/exe")
if [[ $? = 0 ]]; then
exe=${exe##*-> }
if ! [[ "$exe" = $SQUID_EXE ]]; then
SQUID_PIDS[1]=""
fi
else
SQUID_PIDS[1]=""
fi
fi
# Seek by port
SQUID_PIDS[2]=$(
netstat -apn |
awk '/tcp.*[0-9]+\.[0-9]+\.+[0-9]+\.[0-9]+:'$SQUID_PORT' /{
sub("\\/.*", "", $7); print $7; exit}')
}
But the problem remains still:
Squid[18840]: 2013/05/07_18:55:23 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]: 2013/05/07_18:55:24 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]: 2013/05/07_18:55:25 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]: 2013/05/07_18:55:26 INFO: squid:Inconsistent processes:
8207,8207,
In this case, it shows the same process id. The standard error continues to
appear.
I think this shows that the process monitor_squid() doesn't save all the
pids from get_pids() at the moment I stop the daemon. If echoed the output
the get_pid function inside the while loop, it only shows two of the three
pids searches.
monitor_squid()
{
typeset trialcount=0
while true; do
get_pids
if are_all_pids_found; then
are_pids_sane
return $OCF_SUCCESS
fi
if is_squid_dead; then
return $OCF_NOT_RUNNING
fi
ocf_log info "$SQUID_NAME:Inconsistent processes:" \
"${SQUID_PIDS[0]},${SQUID_PIDS[1]},${SQUID_PIDS[2]}"
(( trialcount = trialcount + 1 ))
if (( trialcount > SQUID_CONFIRM_TRIALCOUNT )); then
ocf_log err "$SQUID_NAME:Inconsistency of processes
remains unsolved"
return $OCF_ERR_GENERIC
fi
sleep 1
done
}
I don't know how to correct this issue. Every time this process happen my
failure count goes up.
I cant find a solution, I don't know if the process should stay in the
array value at the moment I stop the daemon, but it shows an error.
Tell me if you need more information.
Thanks in advance
Mauricio Esteban.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130507/b3ee6393/attachment-0003.html>
More information about the Pacemaker
mailing list