[ClusterLabs] 2-node cluster Postgresql problem with lock file

Andrei Borzenkov arvidjaar at gmail.com
Fri Nov 13 09:25:57 UTC 2015


On Thu, Nov 12, 2015 at 11:54 AM, Damien Bras <damien.bras at homesend.com> wrote:
> Hi,
>
>
>
> I have a 2-node cluster using PostgreSQL synchronous streaming replication.
> I don’t have preference of the location of the master role.
>
> I followed this documentation :
> http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster, and it the replication
> works great.
>
>
>
> I just have a question about the PGSQL.lock file.
>
>
>
> When the master crashes (node A), the resources switches on the hot standby
> slave (node B), it’s OK. I can create record on the DB of node B.
>
> But when the old master restart (node A), I have an error “My data may be
> inconsistent” because there is the PGSQL.lock file on this node.
>
> I don’t understand why this file is not deleted on start when the master
> role is on another node of the cluster.
>
>
>
> In my mind, I’d like it to work like that : When A crashes, Master role
> switches on the second node (B) and it has the last data because it’s a
> synchronous replication. So it become the reference. So, when the node A
> restart, pacemaker put it on slave (hot standby), activate the replication
> and that’s all. Later, if B crashes, master roles switches to A, etc ….
>
>
>
> When I delete manually the PGSQL.lock, all works great : the slave
> synchronizes to the master.
>
>
>
> Is there a way to do that automatically ? Is there a function in the pgsql
> RA where pacemaker can delete this lock file when the master role is already
> on another node in the cluster ?
>


I am not familiar with this specific RA agent, but in general - if
failover should be automatic as long as it is possible at all (after
all that is why we have cluster in the first place) failback should be
done under administrator supervision. Only after administrator
actually determined why failover took place, and the most importantly
- if it is safe to destroy old data - can failover be enabled and
performed.

As such, what you describe can be considered feature, not a bug. It
guards against accident failback by forcing administrator to
explicitly remove lock file.

Of course I do not know if this was intentional or not. But looking at
source I tend to say it is.




More information about the Users mailing list