[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...

Mon Apr 2 16:05:09 EDT 2018

Hi!

I've set up a couple test Pacemaker/Corosync/PCS clusters on virtual machines to manage a PostgreSQL service using the PostgreSQL Automatic Failover (PAF) resource agent available here:  https://clusterlabs.github.io/PAF/

The cluster will be up and running with 3 nodes and PostgreSQL running successfully.  However I've run into a nasty stumbling block when attempting to restart a node or even just issuing `pcs cluster stop` and `pcs cluster start` on a node.

The operating system on the nodes is Ubuntu 16.04.

Firstly, here's my configuration (using `crm configure show`):

------
node 1: d-gp2-dbp62-1 \
        attributes master-postgresql-10-main=1001
node 2: d-gp2-dbp62-2 \
        attributes master-postgresql-10-main=1000
node 3: d-gp2-dbp62-3 \
        attributes master-postgresql-10-main=990
primitive postgresql-10-main pgsqlms \
        params bindir="/usr/lib/postgresql/10/bin" pgdata="/var/lib/postgresql/10/main" pghost="/var/run/postgresql/10/main" pgport=5432 recovery_template="/etc/postgresql/10/main/recovery.conf.pcmk" start_opts="-c config_file=/etc/postgresql/10/main/postgresql.conf" \
        op start interval=0s timeout=60s \
        op stop interval=0s timeout=60s \
        op promote interval=0s timeout=30s \
        op demote interval=0s timeout=120s \
        op monitor interval=15s role=Master timeout=10s \
        op monitor interval=16s role=Slave timeout=10s \
        op notify interval=0s timeout=60s
primitive postgresql-master-vip IPaddr2 \
        params ip=10.124.167.174 cidr_netmask=22 \
        op start interval=0s timeout=20s \
        op stop interval=0s timeout=20s \
        op monitor interval=10s
ms postgresql-ha postgresql-10-main \
        meta notify=true target-role=Master
order demote-postgresql-then-remove-vip Mandatory: postgresql-ha:demote postgresql-master-vip:stop symmetrical=false
colocation postgresql-master-with-vip inf: postgresql-master-vip:Started postgresql-ha:Master
order promote-postgresql-then-add-vip Mandatory: postgresql-ha:promote postgresql-master-vip:start symmetrical=false
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.14-70404b0 \
        cluster-infrastructure=corosync \
        cluster-name=d-gp2-dbp62 \
        stonith-enabled=false
rsc_defaults rsc_defaults-options: \
        migration-threshold=5 \
        resource-stickiness=10
------

With the resource running on the first node, here's the output of `pcs status`, and everything looks great:

------
Cluster name: d-gp2-dbp62
Last updated: Mon Apr  2 19:43:52 2018          Last change: Mon Apr  2 19:41:35 2018 by root via crm_resource on d-gp2-dbp62-1
Stack: corosync
Current DC: d-gp2-dbp62-3 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 4 resources configured

Online: [ d-gp2-dbp62-1 d-gp2-dbp62-2 d-gp2-dbp62-3 ]

Full list of resources:

 postgresql-master-vip  (ocf::heartbeat:IPaddr2):       Started d-gp2-dbp62-1
 Master/Slave Set: postgresql-ha [postgresql-10-main]
     Masters: [ d-gp2-dbp62-1 ]
     Slaves: [ d-gp2-dbp62-2 d-gp2-dbp62-3 ]
------

Now, if I restart the second node, and execute `pcs cluster start` once it's back up, it fails to start the resource and shows me this in the `pcs status` output:

------
* postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11, status=complete, exitreason='Instance "postgresql-10-main" failed to start (rc: 1)',
    last-rc-change='Mon Apr  2 19:50:40 2018', queued=0ms, exec=228ms
------

Looking in corosync.log, I see the following:

------
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:     info: log_execute: executing - rsc:postgresql-10-main action:start call_id:11
pgsqlms(postgresql-10-main)[2092]:      2018/04/02_19:50:40  ERROR: Instance "postgresql-10-main" failed to start (rc: 1)
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  postgresql-10-main_start_0:2092:stderr [ pg_ctl: could not start server ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  postgresql-10-main_start_0:2092:stderr [ Examine the log output. ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:   notice: operation_finished:  postgresql-10-main_start_0:2092:stderr [ ocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1) ]
Apr 02 19:50:40 [2014] d-gp2-dbp62-2       lrmd:     info: log_finished:        finished - rsc:postgresql-10-main action:start call_id:11 pid:2092 exit-code:1 exec-time:228ms queue-time:0ms
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:     info: action_synced_wait:  Managed pgsqlms_meta-data_0 process 2115 exited with rc=0
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:   notice: process_lrm_event:   Operation postgresql-10-main_start_0: unknown error (node=d-gp2-dbp62-2, call=11, rc=1, cib-update=15, confirmed=true)
Apr 02 19:50:40 [2017] d-gp2-dbp62-2       crmd:   notice: process_lrm_event:   d-gp2-dbp62-2-postgresql-10-main_start_0:11 [ pg_ctl: could not start server\nExamine the log output.\nocf-exit-reason:Instance "postgresql-10-main" failed to start (rc: 1)\n ]
------

It tells me to examine the PostgreSQL log output, so I look there, but I don't see anything logged at all since the server shutdown.  There are no new log entries indicating that a start was attempted.  I can repeat the `pcs cluster start/stop` commands any number of times, and the problem remains.

Now for the weird part, which is the workaround I inadvertently discovered when trying to figure out what was going wrong.  First, I do a `pcs cluster stop` again.  Then I start up the service using systemd, and it starts up just fine; so I do that and then shut it back down using `service postgresql at 10-main start` and `service postgresql at 10-main stop`, which should put it back in the original state.  Now, when I issue `pcs cluster start`, everything comes up just fine as expected with no errors!?!

I have also seen this "unknown error" come up at other undesirable times, like when doing a manual failover using `pcs cluster stop` on the primary or a `pcs resource move --master ...` command, however once the workaround is applied to the node having issues, it works perfectly fine until it's rebooted.

Can anyone explain what is happening here, and how I can fix it properly?

Best wishes,
-- 
Casey