[Pacemaker] Issues with ocf_run on CentOS7 with pgsql resource agent

Brendan Reekie breekie at sandvine.com
Fri Nov 21 15:00:02 UTC 2014


-----Original Message-----
From: Andrei Borzenkov [mailto:arvidjaar at gmail.com]
Sent: Friday, November 21, 2014 7:13 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] Issues with ocf_run on CentOS7 with pgsql resource agent

On Fri, Nov 21, 2014 at 12:36 AM, Brendan Reekie <breekie at sandvine.com> wrote:
>> I’m running into an issues with the pgsql resource agent running on CentOS7.
>> The issue is when the pgsql resource agent attempts a call to 
>> runasowner it uses a su as user postgres in a call to ocf_run, this is 
>> causing what appears to be a new session on CentOS7.
>>
>>
>>
>> We’ve narrowed it down to the ocf_run method in ocf-shellfuncs in 
>> executing the command:
>>
>>        output=`"$@" 2>&1`
>>
>>
>>
>> Snippet of /var/log/messages:
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: LogActions: Stop
>> pgsql:0  (pcmk-1)
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pengine[2041]: notice: process_pe_message:
>> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
>>
>> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: 
>> Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
>>
>> Nov 20 21:29:05 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: 
>> Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
>>
>> Nov 20 21:29:05 ip-10-0-0-4 pacemakerd[2412]: notice: crm_add_logfile:
>> Additional logging available in /var/log/pacemaker.log
>>
>> Nov 20 21:29:05 ip-10-0-0-4 su: (to postgres) root on none
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Created slice user-26.slice.
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Starting Session c4 of user postgres.
>>
>> Nov 20 21:29:05 ip-10-0-0-4 systemd: Started Session c4 of user postgres.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: process_lrm_event: LRM 
>> operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) 
>> ok
>>
>> Nov 20 21:29:30 ip-10-0-0-4 crmd[2042]: notice: te_rsc_command: 
>> Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
>>
>> Nov 20 21:29:30 ip-10-0-0-4 pacemakerd[2464]: notice: crm_add_logfile:
>> Additional logging available in /var/log/pacemaker.log
>>
>> Nov 20 21:29:30 ip-10-0-0-4 su: (to postgres) root on none
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Created slice user-26.slice.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Starting Session c5 of user postgres.
>>
>> Nov 20 21:29:30 ip-10-0-0-4 systemd: Started Session c5 of user postgres.
>>
>>
>>
>> Software versions:
>>
>>                 pacemaker 1.1.10
>>
>>                 resource-agents 3.9.5
>>
>>                 corosync 2.3.3-2
>>
>>
>>
>> Has anyone else experienced this issue or know of what might be going 
>> wrong to cause new sessions to be created?
>>

>Session is created by pam_systemd. What exact problem does it cause?
>So far you did not mention any.

Sorry Andrei on not clarifying.  The issue that we observed is that the pgsql resource agent will eventually timeout (if the defaults are left at 60 seconds).  Increasing the timeout to several minutes the master/slave are correctly identified and started correctly.  

In the above snippet from /var/log/messages it's taking approximately 25 seconds to execute when su to postgres.

It seems any call to the pgsql resource agent that invokes the runasowner encounters a 25 second delay.  For the corresponding time range the /var/log/pacemaker.log file shows that the notify call 17 started at 21:29:05 but didn't complete to 21:29:30.

Relevant logs:
	Nov 20 21:29:05 [2039] pcmk-1       lrmd:     info: log_execute:        executing - rsc:pgsql action:notify call_id:17
	Nov 20 21:29:30 [2039] pcmk-1       lrmd:     info: log_finished:       finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms

Full log during that time:
Nov 20 21:29:05 [2039] pcmk-1       lrmd:     info: log_finished:       finished - rsc:pgsql action:notify call_id:16 pid:2320 exit-code:0 exec-time:25128ms queue-time:0ms
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: match_graph_event:  Action pgsql_notify_0 (50) confirmed on pcmk-1 (rc=0)
Nov 20 21:29:05 [2042] pcmk-1       crmd:   notice: process_lrm_event:  LRM operation pgsql_notify_0 (call=16, rc=0, cib-update=0, confirmed=true) ok
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: match_graph_event:  Action pgsql_notify_0 (51) confirmed on pcmk-2 (rc=0)
Nov 20 21:29:05 [2042] pcmk-1       crmd:   notice: run_graph:  Transition 0 (Complete=19, Pending=0, Fired=0, Skipped=2, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-204.bz2): Stopped
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: do_state_transition:        State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Nov 20 21:29:05 [2037] pcmk-1        cib:     info: cib_process_request:        Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/30, version=0.291.20)
Nov 20 21:29:05 [2041] pcmk-1    pengine:   notice: unpack_config:      On loss of CCM Quorum: Ignore
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: determine_online_status:    Node pcmk-1 is online
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: determine_online_status:    Node pcmk-2 is online
Nov 20 21:29:05 [2041] pcmk-1    pengine:  warning: unpack_rsc_op:      Processing failed op start for pgsql:0 on pcmk-1: unknown error (1)
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: group_print:         Resource Group: master-group
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_print:            vip-master (ocf::heartbeat:IPaddr2):       Stopped
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_print:            vip-rep    (ocf::heartbeat:IPaddr2):       Stopped
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: clone_print:         Master/Slave Set: msPostgresql [pgsql]
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_print:            pgsql      (ocf::heartbeat:pgsql): FAILED pcmk-1
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: short_print:             Slaves: [ pcmk-2 ]
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: get_failcount_full:         pgsql:0 has failed INFINITY times on pcmk-1
Nov 20 21:29:05 [2041] pcmk-1    pengine:  warning: common_apply_stickiness:    Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: get_failcount_full:         pgsql:1 has failed INFINITY times on pcmk-1
Nov 20 21:29:05 [2041] pcmk-1    pengine:  warning: common_apply_stickiness:    Forcing msPostgresql away from pcmk-1 after 1000000 failures (max=1)
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_color:       Resource pgsql:0 cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: master_color:       msPostgresql: Promoted 0 instances of a possible 1 to master
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: rsc_merge_weights:  vip-master: Rolling back scores from vip-rep
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_color:       Resource vip-master cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: native_color:       Resource vip-rep cannot run anywhere
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: RecurringOp:         Start recurring monitor (4s) for pgsql:1 on pcmk-2
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: RecurringOp:         Start recurring monitor (4s) for pgsql:1 on pcmk-2
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: LogActions:         Leave   vip-master      (Stopped)
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: LogActions:         Leave   vip-rep (Stopped)
Nov 20 21:29:05 [2041] pcmk-1    pengine:   notice: LogActions:         Stop    pgsql:0 (pcmk-1)
Nov 20 21:29:05 [2041] pcmk-1    pengine:     info: LogActions:         Leave   pgsql:1 (Slave pcmk-2)
Nov 20 21:29:05 [2041] pcmk-1    pengine:   notice: process_pe_message:         Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-205.bz2
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: do_state_transition:        State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: do_te_invoke:       Processing graph 1 (ref=pe_calc-dc-1416518945-23) derived from /var/lib/pacemaker/pengine/pe-input-205.bz2
Nov 20 21:29:05 [2042] pcmk-1       crmd:   notice: te_rsc_command:     Initiating action 47: notify pgsql_pre_notify_stop_0 on pcmk-1 (local)
Nov 20 21:29:05 [2042] pcmk-1       crmd:     info: do_lrm_rsc_op:      Performing key=47:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_notify_0
Nov 20 21:29:05 [2039] pcmk-1       lrmd:     info: log_execute:        executing - rsc:pgsql action:notify call_id:17
Nov 20 21:29:05 [2042] pcmk-1       crmd:   notice: te_rsc_command:     Initiating action 48: notify pgsql_pre_notify_stop_0 on pcmk-2
Nov 20 21:29:05 [2412] pcmk-1 pacemakerd:     info: crm_log_init:       Changed active directory to /var/lib/pacemaker/cores/root
Nov 20 21:29:05 [2412] pcmk-1 pacemakerd:     info: crm_xml_cleanup:    Cleaning up memory from libxml2
Nov 20 21:29:30 [2042] pcmk-1       crmd:     info: match_graph_event:  Action pgsql_notify_0 (48) confirmed on pcmk-2 (rc=0)
Nov 20 21:29:30 [2039] pcmk-1       lrmd:     info: log_finished:       finished - rsc:pgsql action:notify call_id:17 pid:2401 exit-code:0 exec-time:25130ms queue-time:1ms
Nov 20 21:29:30 [2042] pcmk-1       crmd:     info: match_graph_event:  Action pgsql_notify_0 (47) confirmed on pcmk-1 (rc=0)
Nov 20 21:29:30 [2042] pcmk-1       crmd:   notice: process_lrm_event:  LRM operation pgsql_notify_0 (call=17, rc=0, cib-update=0, confirmed=true) ok
Nov 20 21:29:30 [2042] pcmk-1       crmd:   notice: te_rsc_command:     Initiating action 1: stop pgsql_stop_0 on pcmk-1 (local)
Nov 20 21:29:30 [2042] pcmk-1       crmd:     info: do_lrm_rsc_op:      Performing key=1:1:0:804110e4-5de5-4a3a-9fd7-2fb96177d7ac op=pgsql_stop_0
Nov 20 21:29:30 [2039] pcmk-1       lrmd:     info: log_execute:        executing - rsc:pgsql action:stop call_id:18
Nov 20 21:29:30 [2464] pcmk-1 pacemakerd:     info: crm_log_init:       Changed active directory to /var/lib/pacemaker/cores/root
Nov 20 21:29:30 [2464] pcmk-1 pacemakerd:     info: crm_xml_cleanup:    Cleaning up memory from libxml2 

 
The other observation is that a call from a test script (ocftest) that calls the ocf_run with the same parameters as the pgsql resource agent doesn't encounter the same delays.  The following is the test script that I've used:

#!/bin/sh

: ${OCF_FUNCTIONS_DIR=${OCF_ROOT}/lib/heartbeat}
. ${OCF_FUNCTIONS_DIR}/ocf-shellfuncs


OCF_RESKEY_pgdba="postgres"
OCF_RESKEY_pgdata="/var/lib/pgsql/9.3/data/"

#
#   Run the given command in the Resource owner environment...
#
runasowner() {
    local quietrun=""
    local loglevel="-err"
    local var

    for var in 1 2
    do
        case "$1" in
            "-q")
                quietrun="-q"
                shift 1;;
            "warn"|"err")
                loglevel="-$1"
                shift 1;;
            *)
                ;;
        esac
    done

    ocf_log info "&&& su called runasowner: $*"
    ocf_run $quietrun $loglevel su $OCF_RESKEY_pgdba -c "cd $OCF_RESKEY_pgdata; $*"
}

runasowner "test -w $OCF_RESKEY_pgdata";


When executing the script multiple times, the 25 second delay is not observed and nor is what appear to be a new session being created.  Output of /var/log/messages:
Nov 21 14:54:23 ip-10-0-0-4 su: (to postgres) centos on none
Nov 21 14:54:24 ip-10-0-0-4 su: (to postgres) centos on none
Nov 21 14:54:25 ip-10-0-0-4 su: (to postgres) centos on none

Stdout capture:
[root at pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
[root at pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/
[root at pcmk-1 heartbeat]# ./ocftest
INFO: &&& su called runasowner: test -w /var/lib/pgsql/9.3/data/

Any ideas on what would be causing the delays?


More information about the Pacemaker mailing list