[Pacemaker] Squid ocf agent. Temporaly unknown error stopping squid daemon.

Wed May 8 00:56:26 CEST 2013

Hello,

I'm working on a ACTIVE/PASSIVE proxy cluster on debian 6.0.7. The cluster
has two resources configured, an OCF IPaddr2 agent and a squid OCF agent.
 The IPaddr2 agent is working fine, but the squid agent shows a temporary
error when I manually stop the service (/etc/init.d/squid stop) in the node
with the

running resource.

 If I stop the resource, restart the server o put on standby the node
running the squid agent, the service stops fine and passes it to the other
node.
The only issue appears during the stop of the service.

The servers (debian 6.0.7) kernel and packages version:

uname -r
2.6.32-5-amd64

pacemaker    1.0.9.1+hg15626-1
heartbeat    1:3.0.3-2
libheartbeat2    1:3.0.3-2

Pacemaker configuration:

crm configure show
node $id="433a22e8-9620-4889-b407-47125a40d4ae" proxyfailovernoc \
        attributes standby="off"

node $id="dfb4a72b-24bb-4410-809d-514952f68e76" proxynoc \
        attributes standby="off"

primitive ClusterIP ocf:heartbeat:IPaddr2 \
        params ip="10.5.15.42" cidr_netmask="32" nic="eth0" \
        op monitor interval="10s"

primitive ClusterSquid ocf:heartbeat:Squid \
        params squid_exe="/usr/sbin/squid"
squid_conf="/etc/squid/squid.conf" squid_pidfile="/var/run/squid.pid"
squid_port="80" squid_stop_timeout="30" \
        op start interval="0" timeout="60s" \
        op stop interval="0" timeout="120s" \
        op monitor interval="2s" timeout="30s" \
        meta target-role="Started" failure-timeout="30s"

property $id="cib-bootstrap-options" \
        dc-version="1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b" \
        cluster-infrastructure="Heartbeat" \
        stonith-enabled="false" \
        last-lrm-refresh="1367960908" \
        expected-quorum-votes="2"

This is my pacemaker status before stopping squid:

crm_mon -1ro
============
Last updated: Tue May  7 18:08:36 2013
Stack: Heartbeat
Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) -
partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ proxyfailovernoc proxynoc ]

Full list of resources:

 ClusterIP      (ocf::heartbeat:IPaddr2):       Started proxyfailovernoc
 ClusterSquid   (ocf::heartbeat:Squid): Started proxynoc

Operations:
* Node proxynoc:
   ClusterIP: migration-threshold=1000000
    + (6) start: rc=0 (ok)
    + (7) monitor: interval=10000ms rc=0 (ok)
    + (9) stop: rc=0 (ok)
   ClusterSquid: migration-threshold=1000000
    + (12) monitor: interval=2000ms rc=0 (ok)
* Node proxyfailovernoc:
   ClusterIP: migration-threshold=1000000
    + (10) stop: rc=0 (ok)
    + (11) start: rc=0 (ok)
    + (12) monitor: interval=10000ms rc=0 (ok)

crm_verify -LVVV
crm_verify[5411]: 2013/05/07_18:10:43 info: main: =#=#=#=#= Getting XML
=#=#=#=#=
crm_verify[5411]: 2013/05/07_18:10:43 info: main: Reading XML from: live
cluster
crm_verify[5411]: 2013/05/07_18:10:43 info: unpack_config: Node scores:
'red' = -INFINITY, 'yellow' = 0, 'green' = 0
crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node
proxynoc is online
crm_verify[5411]: 2013/05/07_18:10:43 info: determine_online_status: Node
proxyfailovernoc is online
crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print: ClusterIP
(ocf::heartbeat:IPaddr2):       Started proxyfailovernoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: native_print:
ClusterSquid        (ocf::heartbeat:Squid): Started proxynoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: RecurringOp:  Start recurring
monitor (2s) for ClusterSquid on proxynoc
crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource
ClusterIP      (Started proxyfailovernoc)
crm_verify[5411]: 2013/05/07_18:10:43 notice: LogActions: Leave resource
ClusterSquid   (Started proxynoc)

Squid status before stopping daemon

netstat -anpt
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address
State       PID/Program name
tcp        0      0 0.0.0.0:80              0.0.0.0:*
LISTEN      2068/(squid)

cat /var/run/squid.pid
2068

ps auxf | grep squid
root      2063  0.0  0.0  21832   660 ?        Ss   18:00   0:00
/usr/sbin/squid -f /etc/squid/squid.conf
proxy     2068  0.0  0.1  29864  6408 ?        S    18:00   0:00  \_
(squid) -f /etc/squid/squid.conf

At the moment I run /etc/init.d/squid stop, appears in the HA log:

tail -f /var/log/ha.log
Squid[7405]:    2013/05/07_18:17:36 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]:    2013/05/07_18:17:37 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]:    2013/05/07_18:17:38 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]:    2013/05/07_18:17:39 INFO: squid:Inconsistent processes:
2063,2068,
Squid[7405]:    2013/05/07_18:17:39 ERROR: squid:Inconsistency of processes
remains unsolved
May 07 18:17:39 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=12, rc=1, cib-update=25,
confirmed=false)

unknon error
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update
relayed from proxyfailovernoc
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_local_callback:
Expanded fail-count-ClusterSquid=value++ to 1
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending
flush op to all hosts for: fail-count-ClusterSquid (1)
May 07 18:17:41 proxynoc lrmd: [1423]: info: cancel_op: operation
monitor[12] on ocf::Squid::ClusterSquid for client 1426, its parameters:

squid_exe=[/usr/sbn/squid] squid_port=[80] crm_feature_set=[3.0.1]
squid_stop_timeout=[30] squid_pidfile=[/var/run/squid.pid]
CRM_meta_name=[monitor]

squid_conf=[/etc/squid/sqid.conf] CRM_meta_timeout=[30000]
CRM_meta_interval=[2000]  cancelled
May 07 18:17:41 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=2:9:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_stop_0 )
May 07 18:17:41 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:13: stop
May 07 18:17:41 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=12, status=1, cib-update=0,
confirmed=true)

Canelled
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent
update 17: fail-count-ClusterSquid=1
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_ha_callback: Update
relayed from proxyfailovernoc
May 07 18:17:41 proxynoc attrd: [1425]: info: find_hash_entry: Creating
hash entry for last-failure-ClusterSquid
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_trigger_update: Sending
flush op to all hosts for: last-failure-ClusterSquid (1367961460)
May 07 18:17:41 proxynoc attrd: [1425]: info: attrd_perform_update: Sent
update 20: last-failure-ClusterSquid=1367961460
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) ls:
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) cannot access /proc/2068/exe
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) : No such file or directory
May 07 18:17:41 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr)

Squid[7471]:    2013/05/07_18:17:42 INFO: squid:stop_squid:315:  stop NORM
1/30
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) ls:
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) cannot access /proc/2068/exe
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr) : No such file or directory
May 07 18:17:42 proxynoc lrmd: [1423]: info: RA output:
(ClusterSquid:stop:stderr)

May 07 18:17:42 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_stop_0 (call=13, rc=0, cib-update=26,
confirmed=true) ok
May 07 18:17:43 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=8:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c op=ClusterSquid_start_0 )
May 07 18:17:43 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:14: start
Squid[7504]:    2013/05/07_18:17:43 INFO: squid:Waiting for squid to be
invoked
Squid[7504]:    2013/05/07_18:17:45 INFO: squid:Waiting for squid to be
invoked
May 07 18:17:46 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_start_0 (call=14, rc=0, cib-update=27,
confirmed=true) ok
May 07 18:17:47 proxynoc crmd: [1426]: info: do_lrm_rsc_op: Performing
key=9:10:0:cb9946fc-164c-495f-a75b-bbaeca933a6c
op=ClusterSquid_monitor_2000 )
May 07 18:17:47 proxynoc lrmd: [1423]: info: rsc:ClusterSquid:15: monitor
May 07 18:17:47 proxynoc crmd: [1426]: info: process_lrm_event: LRM
operation ClusterSquid_monitor_2000 (call=15, rc=0, cib-update=28,
confirmed=false) ok

This status is only shown between 2 to 4 seconds and everything becomes as
before:

crm_mon
============
Last updated: Tue May  7 18:19:22 2013
Stack: Heartbeat
Current DC: proxyfailovernoc (433a22e8-9620-4889-b407-47125a40d4ae) -
partition with quorum
Version: 1.0.9-74392a28b7f31d7ddc86689598bd23114f58978b
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ proxyfailovernoc proxynoc ]

ClusterIP       (ocf::heartbeat:IPaddr2):       Started proxyfailovernoc
ClusterSquid    (ocf::heartbeat:Squid): Started proxynoc FAILED

Failed actions:
    ClusterSquid_monitor_2000 (node=proxynoc, call=15, rc=1,
status=complete): unknown error

The logs shows at the beginning some inconsistent pid numbers, and a
standard error thrown by de RA output. Searching in the squid ocf script

(/usr/lib/ocf/resource.d/heartbeat/Squid), the function get_pids() take the
squid daemon pid from different sources.
 If I execute the three commands it shows:

 # Seek by pattern
 SQUID_PIDS[0]= 8205
 # Seek by pidfile
 SQUID_PIDS[1]= 8207
 # Seek by port
 SQUID_PIDS[2]= 8207

At first glance, the problem seems to be at the search through the pattern.
If i change the command store in the variable, I can correct the value:

#SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")
SQUID_PIDS[0]=$(pgrep -f "\(squid\) -f /etc/squid/squid.conf")

with this change all searches find the same ips

get_pids()
{
        SQUID_PIDS=()

        # Seek by pattern
        SQUID_PIDS[0]=$(pgrep -f "$PROCESS_PATTERN")

        # Seek by pidfile
        SQUID_PIDS[1]=$(awk '1{print $1}' $SQUID_PIDFILE 2>/dev/null)

        if [[ -n "${SQUID_PIDS[1]}" ]]; then
                typeset exe
                exe=$(ls -l "/proc/${SQUID_PIDS[1]}/exe")
                if [[ $? = 0 ]]; then
                        exe=${exe##*-> }
                        if ! [[ "$exe" = $SQUID_EXE ]]; then
                                SQUID_PIDS[1]=""
                        fi
                else
                        SQUID_PIDS[1]=""
                fi
        fi

        # Seek by port
        SQUID_PIDS[2]=$(
                netstat -apn |
                awk '/tcp.*[0-9]+\.[0-9]+\.+[0-9]+\.[0-9]+:'$SQUID_PORT' /{
                        sub("\\/.*", "", $7); print $7; exit}')
}

But the problem remains still:

Squid[18840]:   2013/05/07_18:55:23 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]:   2013/05/07_18:55:24 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]:   2013/05/07_18:55:25 INFO: squid:Inconsistent processes:
8207,8207,
Squid[18840]:   2013/05/07_18:55:26 INFO: squid:Inconsistent processes:
8207,8207,

In this case, it shows the same process id. The standard error continues to
appear.

I think this shows that the process monitor_squid() doesn't save all the
pids from get_pids() at the moment I stop the daemon. If echoed the output
the get_pid function inside the while loop, it only shows two of the three
pids searches.

monitor_squid()
{
        typeset trialcount=0

        while true; do
                get_pids

                if are_all_pids_found; then
                        are_pids_sane
                        return $OCF_SUCCESS
                fi

                if is_squid_dead; then
                        return $OCF_NOT_RUNNING
                fi

                ocf_log info "$SQUID_NAME:Inconsistent processes:" \
                        "${SQUID_PIDS[0]},${SQUID_PIDS[1]},${SQUID_PIDS[2]}"
                (( trialcount = trialcount + 1 ))
                if (( trialcount > SQUID_CONFIRM_TRIALCOUNT )); then
                        ocf_log err "$SQUID_NAME:Inconsistency of processes
remains unsolved"
                        return $OCF_ERR_GENERIC
                fi
                sleep 1
        done
}

I don't know how to correct this issue. Every time this process happen my
failure count goes up.
 I cant find a solution, I don't know if the process should stay in the
array value at the moment I stop the daemon, but it shows an error.

 Tell me if you need more information.

Thanks in advance

Mauricio Esteban.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20130507/b3ee6393/attachment-0001.html>