[Pacemaker] PostgreSQL failed to stop after streaming replication established
Mistina Michal
Michal.Mistina at virte.sk
Tue Aug 27 08:15:06 EDT 2013
Hi all.
I formatted drbd disk to get rid of the corrupted postmaster.pid file. After
this everything works fine. I couldn't reproduce the issue anymore.
Best regards,
Michal Mistina
From: Mistina Michal [mailto:Michal.Mistina at virte.sk]
Sent: Monday, August 19, 2013 9:39 AM
To: The Pacemaker cluster resource manager
Subject: [Pacemaker] PostgreSQL failed to stop after streaming replication
established
Dear community.
The scenario of redundant environment is in the "graphic" representation...
+------------------------------------+
| WAN |
+ v
+------------+------------+ +------------+------------+
|pgsql |pgsql | |pgsql |pgsql
|
+------------+------------+ +------------+------------+
|drbd-pri |drbd-sec | |drbd-pri |drbd-sec |
+------------+------------+ +------------+------------+
| pacemaker | | pacemaker
|
+-------------------------+ +--------------------------+
| corosync | | corosync
|
+------------+------------+ +------------+------------+
|node1 |node2 | |node1 |node2 |
+------------+------------+ +------------+------------+
TC1
TC2
Within each technical center everything worked fine when migrating resources
between nodes.
Then I've set up streaming replication from TC1 to TC2.
Now migration from one node to another failes. Pacemaker operation FAILED to
stop resource postgres. However postgresql was stopped but postmaster.pid
stayed corrupted.
Now I ended up like this.
I am unable to stop postgresql service correctly on TC1 (streaming
replication master). After issuing /etc/init.d/postgresql-9.2 stop the
postmaster.pid remains on the filesystem and moreover it is corrupted. I am
unable to delete it with rm command.
It looks like this:
[root at pcmk1 ~]# ll /var/lib/pgsql/9.2/data/
ls: cannot access /var/lib/pgsql/9.2/data/postmaster.pid: No such file or
directory total 56
drwx------ 7 postgres postgres 62 Jun 26 17:13 base
drwx------ 2 postgres postgres 4096 Aug 18 00:25 global
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_clog
-rw------- 1 postgres postgres 5127 Aug 17 16:24 pg_hba.conf
-rw------- 1 postgres postgres 1636 Jun 26 09:54 pg_ident.conf
drwx------ 2 postgres postgres 4096 Jul 2 00:00 pg_log
drwx------ 4 postgres postgres 34 Jun 26 09:53 pg_multixact
drwx------ 2 postgres postgres 17 Aug 18 00:23 pg_notify
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_serial
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_snapshots
drwx------ 2 postgres postgres 6 Aug 18 00:25 pg_stat_tmp
drwx------ 2 postgres postgres 17 Jun 26 09:54 pg_subtrans
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_tblspc
drwx------ 2 postgres postgres 6 Jun 26 09:53 pg_twophase
-rw------- 1 postgres postgres 4 Jun 26 09:53 PG_VERSION
drwx------ 3 postgres postgres 4096 Aug 18 00:25 pg_xlog
-rw------- 1 postgres postgres 19884 Aug 17 22:54 postgresql.conf
-rw------- 1 postgres postgres 71 Aug 18 00:23 postmaster.opts
?????????? ? ? ? ? ? postmaster.pid
-rw-r--r-- 1 postgres postgres 491 Aug 17 16:33 recovery.done
I don't know if the resource agent did something wrong while pacemaker tried
stopping postgres or actually the postgres is the source component, which
failed to stop correctly. What do you think? Has somebody experienced
problem like this?
I am using:
- pacemaker-1.1.7-6
- corosync-1.4.1-7
- resource-agents-3.9.2-12
- drbd-8.4.3-2
CONFIGURATION
[root at pcmk2 9.2]# crm configure show
node pcmk1 \
attributes standby="off"
node pcmk2 \
attributes standby="off"
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="postgres" \
op monitor interval="15" role="Master" \
op monitor interval="16" role="Slave" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="120"
primitive pg_fs ocf:heartbeat:Filesystem \
params device="/dev/vg_local-lv_pgsql/lv_pgsql"
directory="/var/lib/pgsql/9.2/data" options="noatime,nodiratime"
fstype="xfs" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="120"
primitive pg_lsb lsb:postgresql-9.2 \
op monitor interval="30" timeout="60" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive pg_lvm ocf:heartbeat:LVM \
params volgrpname="vg_local-lv_pgsql" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="30"
primitive pg_vip ocf:heartbeat:IPaddr2 \
params ip="x.x.x.x" iflabel="pcmkvip" \
op monitor interval="5"
group PGServer pg_lvm pg_fs pg_lsb pg_vip \
meta target-role="Started"
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2"
clone-node-max="1" notify="true" target-role="Started"
location master-prefer-node1 pg_vip 50: pcmk1
colocation col_pg_drbd inf: PGServer ms_drbd_pg:Master
order ord_pg inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.7-6.el6-148fccfd5985c5590cc601123c6c16e966b85d14" \
cluster-infrastructure="openais" \
expected-quorum-votes="4" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
maintenance-mode="true" \
last-lrm-refresh="1376753310"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
Best regards,
Michal Mistina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130827/1207c867/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3076 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130827/1207c867/attachment-0003.p7s>
More information about the Pacemaker
mailing list