[ClusterLabs] killing postgres wal progress in slave makes cluster crashed
范国腾
fanguoteng at highgo.com
Sat Apr 28 04:46:31 EDT 2018
Hi,
There are three nodes: node1,node2,node3。node1 is master, node2 and node3 is slave。
We execute the “truncate table” in 14:25:36 and kill the WAL progress in db2。Then the db2 pacemaker is down and db1 is reboot.
But I could not find any information from the /var/log/messages. The flowing is log, could you help find any clue?
Current DC: db3 (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Sat Apr 28 16:37:53 2018 Last change: Sat Apr 28 16:02:25 2018 by hacluster via crmd on db3
3 nodes and 19 resources configured
Node db2: pending
Online: [ db3 ]
OFFLINE: [ db1 ]
Full list of resources:
ipmi_node1 (stonith:fence_ipmilan): Started db3
ipmi_node2 (stonith:fence_ipmilan): Started db3
ipmi_node3 (stonith:fence_ipmilan): Stopped
Clone Set: dlm-clone [dlm]
Started: [ db3 ]
Stopped: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
Started: [ db3 ]
Stopped: [ db1 db2 ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ db3 ]
Stopped: [ db1 db2 ]
Master/Slave Set: pgsql-ha [pgsqld]
Masters: [ db3 ]
Stopped: [ db1 db2 ]
Resource Group: mastergroup
master-vip (ocf::heartbeat:IPaddr2): Started db3
rep-vip (ocf::heartbeat:IPaddr2): Started db3
slave1-vip (ocf::heartbeat:IPaddr2): Stopped
slave2-vip (ocf::heartbeat:IPaddr2): Stopped
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
DB1 /var/log/messages
[cid:image005.png at 01D3DF0E.913D82A0]
DB2 /var/log/messages
[cid:image003.png at 01D3DF0F.335AF9A0]
DB1 postgres log
[cid:image001.png at 01D3DF06.BECE3820]
DB2 postgres log
[cid:image004.png at 01D3DF09.1C243130]
发件人: 徐晓菲
发送时间: 2018年4月27日 9:57
收件人: 邵大明 <shaodaming at highgo.com>; 范国腾 <fanguoteng at highgo.com>; 王亮 <wangliang at highgo.com>
主题: 回复: 回复: message+pglog
嗯嗯,知道了。
还有昨天邮件发log的那个问题,不知道是不是跟truncate tb有关,因为跟下面这种情况一样都做过truncate tb
这还有一种情况:
操作步骤:
(1)1主2备(db1主 db2备 db3备),psql -h master-vip
(2)create tb1; insert tb1执行中
(3)kill一个备机(db3)的流复制进程
(4)该备机重启流复制进程,pcs status仍为原有的1主2备
(5)truncate tb1
(6)重新kill一个备机(db3)的流复制进程(没有执行insert)
(7)原主机db1被关机
(8)db2上执行pcs status和查看进程
[root at sds2 ~]# pcs status
Cluster name: hgpurog
Stack: corosync
Current DC: db2 (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Fri Apr 27 09:47:15 2018 Last change: Fri Apr 27 09:28:01 2018 by root via crm_attribute on db1
3 nodes and 19 resources configured
Node db3: pending
Online: [ db2 ]
OFFLINE: [ db1 ]
Full list of resources:
ipmi_node1 (stonith:fence_ipmilan): Started db2
ipmi_node2 (stonith:fence_ipmilan): Stopped
ipmi_node3 (stonith:fence_ipmilan): Started db2
Clone Set: dlm-clone [dlm]
Started: [ db2 ]
Stopped: [ db1 db3 ]
Clone Set: clvmd-clone [clvmd]
Started: [ db2 ]
Stopped: [ db1 db3 ]
Clone Set: clusterfs-clone [clusterfs]
Started: [ db2 ]
Stopped: [ db1 db3 ]
Master/Slave Set: pgsql-ha [pgsqld]
Slaves: [ db2 ]
Stopped: [ db1 db3 ]
Resource Group: mastergroup
master-vip (ocf::heartbeat:IPaddr2): Stopped
rep-vip (ocf::heartbeat:IPaddr2): Stopped
slave1-vip (ocf::heartbeat:IPaddr2): Stopped
slave2-vip (ocf::heartbeat:IPaddr2): Stopped
Failed Actions:
* pgsqld_promote_0 on db2 'unknown error' (1): call=94, status=Timed Out, exitreason='none',
last-rc-change='Fri Apr 27 09:36:06 2018', queued=0ms, exec=300002ms
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
[root at sds2 ~]#
[highgo at sds2 data]$ ps -ef|grep postgres
highgo 29499 28255 0 09:51 pts/1 00:00:00 grep --color=auto postgres
db3上pcs staus 和查看进程
[root at sds3 ~]# pcs status
Error: cluster is not currently running on this node
[root at sds3 ~]# ps -ef |grep postgres
highgo 4388 1 0 09:19 ? 00:00:13 /home/highgo/hgdb/bin/postgres -D /home/highgo/hgdb/data
highgo 4449 4388 0 09:19 ? 00:00:00 postgres: logger process
highgo 10723 4388 2 09:28 ? 00:00:35 postgres: startup process recovering 0000000900000000000000EF
highgo 10732 4388 0 09:28 ? 00:00:00 postgres: checkpointer process
highgo 10733 4388 0 09:28 ? 00:00:00 postgres: writer process
highgo 11261 4388 0 09:28 ? 00:00:00 postgres: stats collector process
highgo 12229 4388 0 09:50 ? 00:00:00 postgres: wal receiver process
root 12231 17313 0 09:50 pts/0 00:00:00 grep --color=auto postgres
[root at sds3 ~]#
________________________________
[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利!
----------------------------------
徐晓菲 产品检测部
瀚高基础软件股份有限公司
网址:www.highgo.com<http://www.highgo.com/>
地址:济南市高新区新泺大街2117号铭盛大厦20层
手机:183-6307-3951 邮箱:xuxiaofei at highgo.com<mailto:wanghui at highgo.com>
发件人: shaodaming at highgo.com<mailto:shaodaming at highgo.com>
发送时间: 2018-04-27 09:37
收件人: xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>; fanguoteng<mailto:fanguoteng at highgo.com>; 王亮<mailto:wangliang at highgo.com>
主题: 回复: 回复: message+pglog
hi, xiaofei
交叉就是, 如果两个机器作为client server.
一个机器建立400个client访问 备1 数据库1
一个机器建立400 个client 访问 备 2 数据库2
交叉10% 就是 360个访问备1的数据库1, 40个访问备1的数据库2.
就是 360个访问备2的数据库2, 40个访问备2的数据库1.
其他的情况类似按比例改变如上
thanks.
Br.
Bret
________________________________
shaodaming at highgo.com<mailto:shaodaming at highgo.com>
发件人: xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>
发送时间: 2018-04-27 09:18
收件人: 范国腾<mailto:fanguoteng at highgo.com>; wangliang<mailto:wangliang at highgo.com>; shaodaming<mailto:shaodaming at highgo.com>
主题: 回复: message+pglog
哈喽
这里的交叉是指,比如100%交叉是同时发select,比如10%交叉是备一读一段时间之后,备二再读 么
________________________________
[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利!
----------------------------------
徐晓菲 产品检测部
瀚高基础软件股份有限公司
网址:www.highgo.com<http://www.highgo.com/>
地址:济南市高新区新泺大街2117号铭盛大厦20层
手机:183-6307-3951 邮箱:xuxiaofei at highgo.com<mailto:wanghui at highgo.com>
发件人: xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>
发送时间: 2018-04-26 16:33
收件人: 范国腾<mailto:fanguoteng at highgo.com>; wangliang<mailto:wangliang at highgo.com>; shaodaming<mailto:shaodaming at highgo.com>
主题: message+pglog
________________________________
[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利!
----------------------------------
徐晓菲 产品检测部
瀚高基础软件股份有限公司
网址:www.highgo.com<http://www.highgo.com/>
地址:济南市高新区新泺大街2117号铭盛大厦20层
手机:183-6307-3951 邮箱:xuxiaofei at highgo.com<mailto:wanghui at highgo.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 10544 bytes
Desc: image001.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 18486 bytes
Desc: image002.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 64127 bytes
Desc: image004.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 55884 bytes
Desc: image005.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 63363 bytes
Desc: image003.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.rar
Type: application/octet-stream
Size: 466135 bytes
Desc: messages.rar
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0001.obj>
More information about the Users
mailing list