[ClusterLabs] Cluster fails to start on rebooted nodes without manual fiddling...
Jehan-Guillaume de Rorthais
jgdr at dalibo.com
Tue Apr 3 08:58:29 EDT 2018
On Mon, 02 Apr 2018 14:05:09 -0600
Casey & Gina <caseyandgina at icloud.com> wrote:
[...]
> Now, if I restart the second node, and execute `pcs cluster start` once it's
> back up, it fails to start the resource and shows me this in the `pcs status`
> output:
>
>
> ------
> * postgresql-10-main_start_0 on d-gp2-dbp62-2 'unknown error' (1): call=11,
> status=complete, exitreason='Instance "postgresql-10-main" failed to start
> (rc: 1)', last-rc-change='Mon Apr 2 19:50:40 2018', queued=0ms, exec=228ms
> ------
When the cluster was down on "d-gp2-dbp62-2", did PostgreSQL stopped as well on
this node?
[...]
> It tells me to examine the PostgreSQL log output, so I look there, but I
> don't see anything logged at all since the server shutdown.
Because I suppose:
* the error comes from pg_ctl before it was able to actually start PostgreSQL..
* ...or the error raised before PostgreSQL were able to setup its logging
behavior
In both case, something is failing in very early stage.
Something comes in mind: did you setup "systemd-tmpfiles" as explained in the
end of the following chapter ?
https://clusterlabs.github.io/PAF/Quick_Start-Debian-9-pcs.html#postgresql-and-cluster-stack-installation
[...]
> Now for the weird part, which is the workaround I inadvertently discovered
> when trying to figure out what was going wrong. First, I do a `pcs cluster
> stop` again. Then I start up the service using systemd, and it starts up
> just fine; so I do that and then shut it back down using `service
> postgresql at 10-main start` and `service postgresql at 10-main stop`, which should
> put it back in the original state. Now, when I issue `pcs cluster start`,
> everything comes up just fine as expected with no errors!?!
Systemd use the postgresql wrappers (see: man postgresql-common) to start your
cluster. There's a bunch of other actions taking place there. So you can not
compare how postgresql's Debian wrapper behave with how pgsqlms start your
cluster.
> I have also seen this "unknown error" come up at other undesirable times,
> like when doing a manual failover using `pcs cluster stop` on the primary or
> a `pcs resource move --master ...` command, however once the workaround is
> applied to the node having issues, it works perfectly fine until it's
> rebooted.
>
> Can anyone explain what is happening here, and how I can fix it properly?
Make sure Systemd sees PostgreSQL as stopped (and disable it).
Try to start your PostgreSQL using these commands:
sudo -iu postgres
/usr/lib/postgresql/10/bin/pg_ctl --pgdata /var/lib/postgresql/10/main \
-w --timeout 60 start
And report here the errors you can find.
If it starts...report as well, but stop it using:
sudo -iu postgres
/usr/lib/postgresql/10/bin/pg_ctl --pgdata /var/lib/postgresql/10/main \
-w --timeout 60 -m fast stop
More information about the Users
mailing list