[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...

Tue Apr 3 08:58:29 EDT 2018

On Mon, 02 Apr 2018 14:05:09 -0600
Casey & Gina <caseyandgina at icloud.com> wrote:
[...]
> Now, if I restart the second node, and execute `pcs cluster start` once it's
> back up, it fails to start the resource and shows me this in the `pcs status`
> output:
> 
> 
> ------
> * postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11,
> status=complete, exitreason='Instance "postgresql-10-main" failed to start
> (rc: 1)', last-rc-change='Mon Apr  2 19:50:40 2018', queued=0ms, exec=228ms
> ------

When the cluster was down on "d-gp2-dbp62-2", did PostgreSQL stopped as well on
this node?

[...]
> It tells me to examine the PostgreSQL log output, so I look there, but I
> don't see anything logged at all since the server shutdown.

Because I suppose:

* the error comes from pg_ctl before it was able to actually start PostgreSQL..
* ...or the error raised before PostgreSQL were able to setup its logging
  behavior

In both case, something is failing in very early stage.

Something comes in mind: did you setup "systemd-tmpfiles" as explained in the
end of the following chapter ?

https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-pcs.html#postgresql-and-cluster-stack-installation

[...]

> Now for the weird part, which is the workaround I inadvertently discovered
> when trying to figure out what was going wrong.  First, I do a `pcs cluster
> stop` again.  Then I start up the service using systemd, and it starts up
> just fine; so I do that and then shut it back down using `service
> postgresql at 10-main start` and `service postgresql at 10-main stop`, which should
> put it back in the original state.  Now, when I issue `pcs cluster start`,
> everything comes up just fine as expected with no errors!?!

Systemd use the postgresql wrappers (see: man postgresql-common) to start your
cluster. There's a bunch of other actions taking place there. So you can not
compare how postgresql's Debian wrapper behave with how pgsqlms start your
cluster.

> I have also seen this "unknown error" come up at other undesirable times,
> like when doing a manual failover using `pcs cluster stop` on the primary or
> a `pcs resource move --master ...` command, however once the workaround is
> applied to the node having issues, it works perfectly fine until it's
> rebooted.
> 
> Can anyone explain what is happening here, and how I can fix it properly?

Make sure Systemd sees PostgreSQL as stopped (and disable it).

Try to start your PostgreSQL using these commands:

  sudo -iu postgres
  /usr/lib/postgresql/10/bin/pg_ctl --pgdata /var/lib/postgresql/10/main \
    -w --timeout 60 start

And report here the errors you can find.

If it starts...report as well, but stop it using:

  sudo -iu postgres
  /usr/lib/postgresql/10/bin/pg_ctl --pgdata /var/lib/postgresql/10/main \
    -w --timeout 60 -m fast stop