[ClusterLabs] Replicated PGSQL woes

Fri Oct 14 09:39:44 UTC 2016

On Thu, 13 Oct 2016 14:11:06 -0800
Israel Brewster <israel at ravnalaska.net> wrote:

> On Oct 13, 2016, at 1:56 PM, Jehan-Guillaume de Rorthais <jgdr at dalibo.com>
> wrote:
> > 
> > On Thu, 13 Oct 2016 10:05:33 -0800
> > Israel Brewster <israel at ravnalaska.net> wrote:
> >   
> >> On Oct 13, 2016, at 9:41 AM, Ken Gaillot <kgaillot at redhat.com> wrote:  
> >>> 
> >>> On 10/13/2016 12:04 PM, Israel Brewster wrote:    
> > [...]
> >   
> >>>> But whatever- this is a cluster, it doesn't really matter which node
> >>>> things are running on, as long as they are running. So the cluster is
> >>>> working - postgresql starts, the master process is on the same node as
> >>>> the IP, you can connect, etc, everything looks good. Obviously the next
> >>>> thing to try is failover - should the master node fail, the slave node
> >>>> should be promoted to master. So I try testing this by shutting down the
> >>>> cluster on the primary server: "pcs cluster stop"
> >>>> ...and nothing happens. The master shuts down (uncleanly, I might add -
> >>>> it leaves behind a lock file that prevents it from starting again until
> >>>> I manually remove said lock file), but the slave is never promoted to    
> >>> 
> >>> This definitely needs to be corrected. What creates the lock file, and
> >>> how is that entity managed?    
> >> 
> >> The lock file entity is created/managed by the postgresql process itself.
> >> On launch, postgres creates the lock file to say it is running, and
> >> deletes said lock file when it shuts down. To my understanding, its role
> >> in life is to prevent a restart after an unclean shutdown so the admin is
> >> reminded to make sure that the data is in a consistent state before
> >> starting the server again.  
> > 
> > What is the name of this lock file? Where is it?
> > 
> > PostgreSQL does not create lock file. It creates a "postmaster.pid" file,
> > but it does not forbid a startup if the new process doesn't find another
> > process with the pid and shm shown in the postmaster.pid.
> > 
> > As far as I know, the pgsql resource agent create such a lock file on
> > promote and delete it on graceful stop. If the PostgreSQL instance couldn't
> > be stopped correctly, the lock files stays and the RA refuse to start it
> > the next time.  
> 
> Ah, you're right. Looking auth the RA I see where it creates the file in
> question. The delete appears to be in the pgsql_real_stop() function (which
> makes sense), wrapped in an if block that checks for $1 being master and
> $OCF_RESKEY_CRM_meta_notify_slave_uname being a space. Throwing a little
> debugging code in there I see that when it hits that block on a cluster stop,
> $OCF_RESKEY_CRM_meta_notify_slave_uname is centtest1.ravnalaska.net
> <http://centtest1.ravnalaska.net/>, not a space, so the lock file is not
> removed:
> 
>     if  [ "$1" = "master" -a "$OCF_RESKEY_CRM_meta_notify_slave_uname" = " "
> ]; then ocf_log info "Removing $PGSQL_LOCK."
>         rm -f $PGSQL_LOCK
>     fi 
> 
> It doesn't look like there is anywhere else where the file would be removed.

This is quite wrong to me for two reasons (I'll try to be clear):

1) the resource agent (RA) make sure the timeline (TL) will not be incremented
during promotion.

As there is no documentation about that, I'm pretty sure this contortion comes
from limitations in very old versions of PostgreSQL (<= 9.1):

  * a slave wasn't able to cross a timeline (TL) from streaming replication,
    only from WAL archives. That means crossing a TL was requiring to restart
    the slave or cutting the streaming rep temporary to force it to get back to
    the archives
  * moreover, it was possible a standby miss some transactions on after a clean
    master shutdown. That means the old master couldn't get back to the
    cluster as a slave safely, as the TL is still the same...

See slide 35->37: http://www.slideshare.net/takmatsuo/2012929-pg-study-16012253

In my understanding, that's why we make sure there's no slave around before
shutting down the master: should the master go back later cleanly, we make sure
no one could be promoted in the meantime.

Note that considering this issue and how the RA tries to avoid it, this test on
slave being shutdown before master is quite weak anyway...

Last but not least, the two PostgreSQL limitations the RA is messing with have
been fixed long time ago in 9.3:
  * https://www.postgresql.org/docs/current/static/release-9-3.html#AEN138909
  *
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=985bd7d49726c9f178558491d31a570d47340459

...but it requires PostgreSQL 9.3+ for the timeline issue. By the way, I suspect
this is related to the "restart_on_promote" parameter of the RA.

2) from a recent discussion on this list (or maybe on -dev), RA devs should not
rely on OCF_RESKEY_CRM_meta_notify_* vars outside of "notify" actions.

> > [...]  
> >>>> What can I do to fix this? What troubleshooting steps can I follow?
> >>>> Thanks.  
> > 
> > I can not find the result of the stop operation in your log files, maybe the
> > log from CentTest2 would be more useful.  
> 
> Sure. I was looking at centtest1 because I was trying to figure out why it
> wouldn't promote, but if centtest2 never really stopped (properly) that could
> explain things. Here's the log from 2 when calling pcs cluster stop:
> 
> [log log log]

Well tis is a normal shutdown and the master was shutdown cleanly. AS you
pointed out, the lock file stayed there because some slaves were still up.

I **guess** if you really want a shutdown to occurs, you need to simulate a real
failure, not shutting down the first node cleanly. Try to kill corosync.

> > but I can find this:
> > 
> >  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
> >  centtest2.ravnalaska.net for shutdown
> >  ...
> >  Oct 13 08:29:41 CentTest1 pengine[30095]:   notice: Scheduling Node
> >  centtest2.ravnalaska.net for shutdown
> > 
> > Which means the stop operation probably raised an error, leading to a
> > fencing of the node. In this circumstance, I bet PostgreSQL wasn't able to
> > stop correctly and the lock file stayed in place.
> > 
> > Could you please show us your full cluster setup?  
> 
> Sure: how? pcs status shows this, but I suspect that's not what you are
> asking about:

"pcs config" would do the trick.

-- 
Jehan-Guillaume de Rorthais
Dalibo