[ClusterLabs] ocf:heartbeat:pgsql not starting

Thu Aug 11 21:44:03 UTC 2016

Hi,

I have PostgreSQL 9.3 replicated and I'm trying to put it under Pacemaker control
using ocf:heartbeat:pgsql provided by SLES12SP1.

This is the crmsh script that I used to configure Pacemaker.

        configure cib new pgsql_cfg --force
        configure primitive res-ars-pgsql ocf:heartbeat:pgsql \
           pgctl="/usr/lib/postgresql93/bin/pg_ctl" \
           psql="/usr/lib/postgresql93/bin/psql" \
           pgdata="/var/lib/pgsql/data/" \
           rep_mode="sync" \
           node_list="ars1 ars2" \
           restore_command="cp /var/lib/pgsql/pg_archive/%f %p" \
           primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" \
           master_ip="192.168.244.223" \
           restart_on_promote='true' \
           pghost="191.168.244.223" \
           repuser="postgres" \
           check_wal_receiver='true' \
           monitor_user='postgres' \
           monitor_password='xxx' \
           op start   timeout="120s" interval="0s"  on-fail="restart" \
           op monitor timeout="120s" interval="4s" on-fail="restart" \
           op monitor timeout="120s" interval="3s"  on-fail="restart" role="Master" \
           op promote timeout="120s" interval="0s"  on-fail="restart" \
           op demote  timeout="120s" interval="0s"  on-fail="stop" \
           op stop    timeout="120s" interval="0s"  on-fail="block" \
           op notify  timeout="90s" interval="0s"
        configure ms ms-ars-pgsql res-ars-pgsql \
           meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
        configure colocation col-ars-pgsql-with-drbd inf: ms-ars-pgsql:Master ms-ars-drbd:Master
        configure cib commit pgsql_cfg

I have a ~postgres/.pgpass

My nodes remain stopped and only once during the 12 hours I've been working on this
did both nodes try to bring up PG (both in recovery mode) before shutting them both down.

When running ocf-tester I think that I'm to name the master/slave resource.

        ars2:/usr/lib/ocf/resource.d/heartbeat # ocf-tester -v -n ms-ars-pgsql `pwd`/pgsql
        Beginning tests for /usr/lib/ocf/resource.d/heartbeat/pgsql...
        Testing permissions with uid nobody
        Testing: meta-data
        Testing: meta-data
        ...
        <XML removed/>
        ...
        Testing: validate-all
        Checking current state
        Testing: stop
        INFO: waiting for server to shut down.... done server stopped
        INFO: PostgreSQL is down
        Testing: monitor
        INFO: PostgreSQL is down
        Testing: monitor
        ocf-exit-reason:Setup problem: couldn't find command: /usr/bin/pg_ctl
        Testing: start
        INFO: server starting
        INFO: PostgreSQL start command sent.
        INFO: PostgreSQL is started.
        Testing: monitor
        Testing: monitor
        INFO: Don't check /var/lib/pgsql/data during probe
        Testing: notify
        Checking for demote action
        ocf-exit-reason:Not in a replication mode.
        Checking for promote action
        ocf-exit-reason:Not in a replication mode.
        Testing: demotion of started resource
        ocf-exit-reason:Not in a replication mode.
        * rc=6: Demoting a start resource should not fail
        Testing: promote
        ocf-exit-reason:Not in a replication mode.
        * rc=6: Promote failed
        Testing: demote
        ocf-exit-reason:Not in a replication mode.
        * rc=6: Demote failed
        Aborting tests

'Not in a replication mode' disagrees with the res-ars-pgsql above.
I'm not sure that the pacemaker.log for CIB changes is needed.

        Aug 11 09:19:53 [2757] ars2    pengine:     info: clone_print:   Master/Slave Set: ms-ars-pgsql [res-ars-pgsql]
        Aug 11 09:19:53 [2757] ars2    pengine:     info: short_print:       Stopped: [ ars1 ars2 ]
        Aug 11 09:19:53 [2757] ars2    pengine:     info: get_failcount_full:   res-ars-pgsql:0 has failed INFINITY times on ars1
        Aug 11 09:19:53 [2757] ars2    pengine:  warning: common_apply_stickiness:      Forcing ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: get_failcount_full:   ms-ars-pgsql has failed INFINITY times on ars1
        Aug 11 09:19:53 [2757] ars2    pengine:  warning: common_apply_stickiness:      Forcing ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: get_failcount_full:   res-ars-pgsql:0 has failed INFINITY times on ars2
        Aug 11 09:19:53 [2757] ars2    pengine:  warning: common_apply_stickiness:      Forcing ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: get_failcount_full:   ms-ars-pgsql has failed INFINITY times on ars2
        Aug 11 09:19:53 [2757] ars2    pengine:  warning: common_apply_stickiness:      Forcing ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: rsc_merge_weights:    ms-ars-drbd: Rolling back scores from ms-ars-pgsql
        Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: Promoting res-ars-drbd:1 (Master ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: ms-ars-drbd: Promoted 1 instances of a possible 1 to master
        Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: res-ars-pgsql:0: Rolling back scores from ms-ars-drbd
        Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: Resource res-ars-pgsql:0 cannot run anywhere
        Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: res-ars-pgsql:1: Rolling back scores from ms-ars-drbd
        Aug 11 09:19:53 [2757] ars2    pengine:     info: native_color: Resource res-ars-pgsql:1 cannot run anywhere
        Aug 11 09:19:53 [2757] ars2    pengine:     info: master_color: ms-ars-pgsql: Promoted 0 instances of a possible 1 to master
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-mgmt-vip    (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-mgmt-app    (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-vip     (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-drbd:0  (Slave ars1)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-drbd:1  (Master ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-lvm     (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-fs_dropbox      (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-fs_svndata      (Started ars2)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-pgsql:0 (Stopped)
        Aug 11 09:19:53 [2757] ars2    pengine:     info: LogActions:   Leave   res-ars-pgsql:1 (Stopped)
        Aug 11 09:19:53 [2758] ars2       crmd:     info: do_state_transition:  State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
        Aug 11 09:19:53 [2758] ars2       crmd:   notice: do_te_invoke: Processing graph 222 (ref=pe_calc-dc-1470932393-1349) derived from /var/lib/pacemaker/pengine/pe-input-625.bz2

and /var/log/messages

        2016-08-11T09:19:53.146603-07:00 ars-2 crmd[2758]:   notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
        2016-08-11T09:19:53.152322-07:00 ars-2 pengine[2757]:   notice: On loss of CCM Quorum: Ignore
        2016-08-11T09:19:53.153078-07:00 ars-2 pengine[2757]:  warning: Forcing ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
        2016-08-11T09:19:53.153266-07:00 ars-2 pengine[2757]:  warning: Forcing ms-ars-pgsql away from ars1 after 1000000 failures (max=1000000)
        2016-08-11T09:19:53.153395-07:00 ars-2 pengine[2757]:  warning: Forcing ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
        2016-08-11T09:19:53.153547-07:00 ars-2 pengine[2757]:  warning: Forcing ms-ars-pgsql away from ars2 after 1000000 failures (max=1000000)
        2016-08-11T09:19:53.155568-07:00 ars-2 crmd[2758]:   notice: Processing graph 222 (ref=pe_calc-dc-1470932393-1349) derived from /var/lib/pacemaker/pengine/pe-input-625.bz2
        2016-08-11T09:19:53.155768-07:00 ars-2 pengine[2757]:   notice: Calculated Transition 222: /var/lib/pacemaker/pengine/pe-input-625.bz2
        2016-08-11T09:19:53.155927-07:00 ars-2 crmd[2758]:   notice: Transition 222 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-625.bz2): Complete
        2016-08-11T09:19:53.156085-07:00 ars-2 crmd[2758]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

Can anyone provide thoughs on how to debug this?
Should I give up with the SLES provided RA and use PAF instead?

Thanks,
Darren

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20160811/b24c5ebf/attachment-0003.html>