<div dir="ltr">Hello,<br><br>I'm working on a ACTIVE/PASSIVE proxy cluster on debian 6.0.7. The cluster has two resources configured, an OCF IPaddr2 agent and a squid OCF agent.<br> The IPaddr2 agent is working fine, but the squid agent shows a temporary error when I manually stop the service (/etc/init.d/squid stop) in the node with the <br>
<br>running resource.<br><br> If I stop the resource, restart the server o put on standby the node running the squid agent, the service stops fine and passes it to the other node.<br>The only issue appears during the stop of the service.<br>
<br><br>The servers (debian 6.0.7) kernel and packages version:<br><br>uname -r<br>2.6.32-5-amd64<br><br>pacemaker 1.0.9.1+hg15626-1<br>heartbeat 1:3.0.3-2<br>libheartbeat2 1:3.0.3-2<br><br><br>Pacemaker configuration:<br>
<br>crm configure show<br>node $id="433a22e8-9620-4889-b407-47125a40d4ae" proxyfailovernoc \<br> attributes standby="off"<br><br>node $id="dfb4a72b-24bb-4410-809d-514952f68e76" proxynoc \<br>
attributes standby="off"<br><br>primitive ClusterIP ocf:heartbeat:IPaddr2 \<br> params ip="10.5.15.42" cidr_netmask="32" nic="eth0" \<br> op monitor interval="10s"<br>
<br>primitive ClusterSquid ocf:heartbeat:Squid \<br> params squid_exe="/usr/sbin/squid" squid_conf="/etc/squid/squid.conf" squid_pidfile="/var/run/squid.pid" squid_port="80" squid_stop_timeout="30" \<br>
op start interval="0" timeout="60s" \<br> op stop interval="0" timeout="120s" \<br> op monitor interval="2s" timeout="30s" \<br> meta target-role="Started" failure-timeout="30s"<br>
<br>property $id="cib-bootstrap-options" \<br> dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \<br> cluster-infrastructure="Heartbeat" \<br> stonith-enabled="false" \<br>
last-lrm-refresh="1367960908" \<br> expected-quorum-votes="2"<br><br><br><br><br>This is my pacemaker status before stopping squid:<br><br>crm_mon -1ro<br>============<br>Last updated: Tue May 7 18:08:36 2013<br>
Stack: Heartbeat<br>Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) - partition with quorum<br>Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b<br>2 Nodes configured, 2 expected votes<br>2 Resources configured.<br>
============<br><br>Online: [ proxyfailovernoc proxynoc ]<br><br>Full list of resources:<br><br> ClusterIP (ocf::heartbeat:IPaddr2): Started proxyfailovernoc<br> ClusterSquid (ocf::heartbeat:Squid): Started proxynoc<br>
<br>Operations:<br>* Node proxynoc:<br> ClusterIP: migration-threshold=1000000<br> + (6) start: rc=0 (ok)<br> + (7) monitor: interval=10000ms rc=0 (ok)<br> + (9) stop: rc=0 (ok)<br> ClusterSquid: migration-threshold=1000000<br>
+ (12) monitor: interval=2000ms rc=0 (ok)<br>* Node proxyfailovernoc:<br> ClusterIP: migration-threshold=1000000<br> + (10) stop: rc=0 (ok)<br> + (11) start: rc=0 (ok)<br> + (12) monitor: interval=10000ms rc=0 (ok)<br>
<br>crm_verify -LVVV<br>crm_verify[5411]: 2013/05/07_18:10:43 info: main: =#=#=#=#= Getting XML =#=#=#=#=<br>crm_verify[5411]: 2013/05/07_18:10:43 info: main: Reading XML from: live cluster<br>crm_verify[5411]: 2013/05/07_18:10:43 info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0<br>
crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node proxynoc is online<br>crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node proxyfailovernoc is online<br>crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print: ClusterIP (ocf::heartbeat:IPaddr2): Started proxyfailovernoc<br>
crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print: ClusterSquid (ocf::heartbeat:Squid): Started proxynoc<br>crm_verify[5411]: 2013/05/07_18:10:43 notice: RecurringOp: Start recurring monitor (2s) for ClusterSquid on proxynoc<br>
crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource ClusterIP (Started proxyfailovernoc)<br>crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource ClusterSquid (Started proxynoc)<br>
<br><br>Squid status before stopping daemon<br><br>netstat -anpt<br>Active Internet connections (servers and established)<br>Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name<br>
tcp 0 0 <a href="http://0.0.0.0:80">0.0.0.0:80</a> 0.0.0.0:* LISTEN 2068/(squid)<br><br>cat /var/run/squid.pid<br>2068<br><br>ps auxf | grep squid<br>root 2063 0.0 0.0 21832 660 ? Ss 18:00 0:00 /usr/sbin/squid -f /etc/squid/squid.conf<br>
proxy 2068 0.0 0.1 29864 6408 ? S 18:00 0:00 \_ (squid) -f /etc/squid/squid.conf<br><br><br>At the moment I run /etc/init.d/squid stop, appears in the HA log:<br><br><br>tail -f /var/log/ha.log<br>Squid[7405]: 2013/05/07_18:17:36 INFO: squid:Inconsistent processes: 2063,2068,<br>
Squid[7405]: 2013/05/07_18:17:37 INFO: squid:Inconsistent processes: 2063,2068,<br>Squid[7405]: 2013/05/07_18:17:38 INFO: squid:Inconsistent processes: 2063,2068,<br>Squid[7405]: 2013/05/07_18:17:39 INFO: squid:Inconsistent processes: 2063,2068,<br>
Squid[7405]: 2013/05/07_18:17:39 ERROR: squid:Inconsistency of processes remains unsolved<br>May 07 18:17:39 proxynoc crmd: [1426]: info: process_lrm_event: LRM operation ClusterSquid_monitor_2000 (call=12, rc=1, cib-update=25, confirmed=false) <br>
<br>unknon error<br>May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update relayed from proxyfailovernoc<br>May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_local_callback: Expanded fail-count-ClusterSquid=value++ to 1<br>
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ClusterSquid (1)<br>May 07 18:17:41 proxynoc lrmd: [1423]: info: cancel_op: operation monitor[12] on ocf::Squid::ClusterSquid for client 1426, its parameters: <br>
<br>squid_exe=[/usr/sbn/squid] squid_port=[80] crm_feature_set=[3.0.1] squid_stop_timeout=[30] squid_pidfile=[/var/run/squid.pid] CRM_meta_name=[monitor] <br><br>squid_conf=[/etc/squid/sqid.conf] CRM_meta_timeout=[30000] CRM_meta_interval=[2000] cancelled<br>
May 07 18:17:41 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing key=2:9:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_stop_0 )<br>May 07 18:17:41 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:13: stop<br>May 07 18:17:41 proxynoc crmd: [1426]: info: process_lrm_event: LRM operation ClusterSquid_monitor_2000 (call=12, status=1, cib-update=0, confirmed=true) <br>
<br>Canelled<br>May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent update 17: fail-count-ClusterSquid=1<br>May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update relayed from proxyfailovernoc<br>
May 07 18:17:41 proxynoc attrd: [1425]: info: find_hash_entry: Creating hash entry for last-failure-ClusterSquid<br>May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ClusterSquid (1367961460)<br>
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent update 20: last-failure-ClusterSquid=1367961460<br>May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) ls:<br>May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) cannot access /proc/2068/exe<br>
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) : No such file or directory<br>May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr)<br><br>Squid[7471]: 2013/05/07_18:17:42 INFO: squid:stop_squid:315: stop NORM 1/30<br>
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) ls:<br>May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) cannot access /proc/2068/exe<br>May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr) : No such file or directory<br>
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output: (ClusterSquid:stop:stderr)<br><br>May 07 18:17:42 proxynoc crmd: [1426]: info: process_lrm_event: LRM operation ClusterSquid_stop_0 (call=13, rc=0, cib-update=26, confirmed=true) ok<br>
May 07 18:17:43 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing key=8:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_start_0 )<br>May 07 18:17:43 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:14: start<br>
Squid[7504]: 2013/05/07_18:17:43 INFO: squid:Waiting for squid to be invoked<br>Squid[7504]: 2013/05/07_18:17:45 INFO: squid:Waiting for squid to be invoked<br>May 07 18:17:46 proxynoc crmd: [1426]: info: process_lrm_event: LRM operation ClusterSquid_start_0 (call=14, rc=0, cib-update=27, confirmed=true) ok<br>
May 07 18:17:47 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing key=9:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_monitor_2000 )<br>May 07 18:17:47 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:15: monitor<br>
May 07 18:17:47 proxynoc crmd: [1426]: info: process_lrm_event: LRM operation ClusterSquid_monitor_2000 (call=15, rc=0, cib-update=28, confirmed=false) ok<br><br><br><br>This status is only shown between 2 to 4 seconds and everything becomes as before:<br>
<br>crm_mon<br>============<br>Last updated: Tue May 7 18:19:22 2013<br>Stack: Heartbeat<br>Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) - partition with quorum<br>Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b<br>
2 Nodes configured, 2 expected votes<br>2 Resources configured.<br>============<br><br>Online: [ proxyfailovernoc proxynoc ]<br><br>ClusterIP (ocf::heartbeat:IPaddr2): Started proxyfailovernoc<br>ClusterSquid (ocf::heartbeat:Squid): Started proxynoc FAILED<br>
<br>Failed actions:<br> ClusterSquid_monitor_2000 (node=proxynoc, call=15, rc=1, status=complete): unknown error<br><br><br>The logs shows at the beginning some inconsistent pid numbers, and a standard error thrown by de RA output. Searching in the squid ocf script <br>
<br>(/usr/lib/ocf/resource.d/heartbeat/Squid), the function get_pids() take the squid daemon pid from different sources.<br> If I execute the three commands it shows:<br><br> # Seek by pattern<br> SQUID_PIDS[0]= 8205<br> # Seek by pidfile<br>
SQUID_PIDS[1]= 8207<br> # Seek by port<br> SQUID_PIDS[2]= 8207<br><br>At first glance, the problem seems to be at the search through the pattern. If i change the command store in the variable, I can correct the value:<br>
<br>#SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")<br>SQUID_PIDS[0]=$(pgrep -f "\(squid\) -f /etc/squid/squid.conf")<br><br>with this change all searches find the same ips<br><br>get_pids()<br>{<br> SQUID_PIDS=()<br>
<br> # Seek by pattern<br> SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")<br><br> # Seek by pidfile<br> SQUID_PIDS[1]=$(awk '1{print $1}' $SQUID_PIDFILE 2>/dev/null)<br><br> if [[ -n "${SQUID_PIDS[1]}" ]]; then<br>
typeset exe<br> exe=$(ls -l "/proc/${SQUID_PIDS[1]}/exe")<br> if [[ $? = 0 ]]; then<br> exe=${exe##*-> }<br> if ! [[ "$exe" = $SQUID_EXE ]]; then<br>
SQUID_PIDS[1]=""<br> fi<br> else<br> SQUID_PIDS[1]=""<br> fi<br> fi<br><br> # Seek by port<br>
SQUID_PIDS[2]=$(<br> netstat -apn |<br> awk '/tcp.*[0-9]+\.[0-9]+\.+[0-9]+\.[0-9]+:'$SQUID_PORT' /{<br> sub("\\/.*", "", $7); print $7; exit}')<br>
}<br><br>But the problem remains still:<br><br>Squid[18840]: 2013/05/07_18:55:23 INFO: squid:Inconsistent processes: 8207,8207,<br>Squid[18840]: 2013/05/07_18:55:24 INFO: squid:Inconsistent processes: 8207,8207,<br>Squid[18840]: 2013/05/07_18:55:25 INFO: squid:Inconsistent processes: 8207,8207,<br>
Squid[18840]: 2013/05/07_18:55:26 INFO: squid:Inconsistent processes: 8207,8207,<br><br>In this case, it shows the same process id. The standard error continues to appear.<br><br>I think this shows that the process monitor_squid() doesn't save all the pids from get_pids() at the moment I stop the daemon. If echoed the output the get_pid function inside the while loop, it only shows two of the three pids searches.<br>
<br>monitor_squid()<br>{<br> typeset trialcount=0<br><br> while true; do<br> get_pids<br><br> if are_all_pids_found; then<br> are_pids_sane<br> return $OCF_SUCCESS<br>
fi<br><br> if is_squid_dead; then<br> return $OCF_NOT_RUNNING<br> fi<br><br> ocf_log info "$SQUID_NAME:Inconsistent processes:" \<br>
"${SQUID_PIDS[0]},${SQUID_PIDS[1]},${SQUID_PIDS[2]}"<br> (( trialcount = trialcount + 1 ))<br> if (( trialcount > SQUID_CONFIRM_TRIALCOUNT )); then<br> ocf_log err "$SQUID_NAME:Inconsistency of processes remains unsolved"<br>
return $OCF_ERR_GENERIC<br> fi<br> sleep 1<br> done<br>}<br><br>I don't know how to correct this issue. Every time this process happen my failure count goes up.<br>
I cant find a solution, I don't know if the process should stay in the array value at the moment I stop the daemon, but it shows an error.<br><br> Tell me if you need more information.<br><br>Thanks in advance<br><br>
<br>Mauricio Esteban.<br><br><br><br><br><br><br></div>