[ClusterLabs] killing postgres wal progress in slave makes cluster crashed

Sat Apr 28 04:46:31 EDT 2018

Hi,
There are three nodes: node1,node2,node3。node1 is master, node2 and node3 is slave。
We execute the “truncate table” in 14:25:36 and kill the WAL progress in db2。Then the db2 pacemaker is down and db1 is reboot.
But I could not find any information from the /var/log/messages. The flowing is log, could you help find any clue?

Current DC: db3 (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Sat Apr 28 16:37:53 2018          Last change: Sat Apr 28 16:02:25 2018 by hacluster via crmd on db3

3 nodes and 19 resources configured

Node db2: pending
Online: [ db3 ]
OFFLINE: [ db1 ]

Full list of resources:

ipmi_node1     (stonith:fence_ipmilan):        Started db3
ipmi_node2     (stonith:fence_ipmilan):        Started db3
ipmi_node3     (stonith:fence_ipmilan):        Stopped
Clone Set: dlm-clone [dlm]
     Started: [ db3 ]
     Stopped: [ db1 db2 ]
Clone Set: clvmd-clone [clvmd]
     Started: [ db3 ]
     Stopped: [ db1 db2 ]
Clone Set: clusterfs-clone [clusterfs]
     Started: [ db3 ]
     Stopped: [ db1 db2 ]
Master/Slave Set: pgsql-ha [pgsqld]
     Masters: [ db3 ]
     Stopped: [ db1 db2 ]
Resource Group: mastergroup
     master-vip (ocf::heartbeat:IPaddr2):       Started db3
     rep-vip    (ocf::heartbeat:IPaddr2):       Started db3
slave1-vip     (ocf::heartbeat:IPaddr2):       Stopped
slave2-vip     (ocf::heartbeat:IPaddr2):       Stopped

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

DB1 /var/log/messages
[cid:image005.png at 01D3DF0E.913D82A0]

DB2 /var/log/messages
[cid:image003.png at 01D3DF0F.335AF9A0]

DB1 postgres log
[cid:image001.png at 01D3DF06.BECE3820]
DB2 postgres log
[cid:image004.png at 01D3DF09.1C243130]

发件人: 徐晓菲
发送时间: 2018年4月27日 9:57
收件人: 邵大明 <shaodaming at highgo.com>; 范国腾 <fanguoteng at highgo.com>; 王亮 <wangliang at highgo.com>
主题: 回复: 回复: message+pglog

嗯嗯，知道了。

还有昨天邮件发log的那个问题，不知道是不是跟truncate tb有关，因为跟下面这种情况一样都做过truncate tb

这还有一种情况：
操作步骤:
（1）1主2备（db1主  db2备  db3备），psql -h master-vip
（2）create tb1；  insert tb1执行中
（3）kill一个备机（db3）的流复制进程
（4）该备机重启流复制进程，pcs status仍为原有的1主2备

（5）truncate tb1
（6）重新kill一个备机（db3）的流复制进程（没有执行insert）
（7）原主机db1被关机
（8）db2上执行pcs status和查看进程
[root at sds2 ~]# pcs status
Cluster name: hgpurog
Stack: corosync
Current DC: db2 (version 1.1.15-11.el7-e174ec8) - partition with quorum
Last updated: Fri Apr 27 09:47:15 2018      Last change: Fri Apr 27 09:28:01 2018 by root via crm_attribute on db1

3 nodes and 19 resources configured

Node db3: pending
Online: [ db2 ]
OFFLINE: [ db1 ]

Full list of resources:

 ipmi_node1    (stonith:fence_ipmilan):    Started db2
 ipmi_node2    (stonith:fence_ipmilan):    Stopped
 ipmi_node3    (stonith:fence_ipmilan):    Started db2
 Clone Set: dlm-clone [dlm]
     Started: [ db2 ]
     Stopped: [ db1 db3 ]
 Clone Set: clvmd-clone [clvmd]
     Started: [ db2 ]
     Stopped: [ db1 db3 ]
 Clone Set: clusterfs-clone [clusterfs]
     Started: [ db2 ]
     Stopped: [ db1 db3 ]
 Master/Slave Set: pgsql-ha [pgsqld]
     Slaves: [ db2 ]
     Stopped: [ db1 db3 ]
 Resource Group: mastergroup
     master-vip  (ocf::heartbeat:IPaddr2):   Stopped
     rep-vip (ocf::heartbeat:IPaddr2):   Stopped
 slave1-vip    (ocf::heartbeat:IPaddr2):   Stopped
 slave2-vip    (ocf::heartbeat:IPaddr2):   Stopped

Failed Actions:
* pgsqld_promote_0 on db2 'unknown error' (1): call=94, status=Timed Out, exitreason='none',
    last-rc-change='Fri Apr 27 09:36:06 2018', queued=0ms, exec=300002ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root at sds2 ~]#

[highgo at sds2 data]$ ps -ef|grep postgres
highgo   29499 28255  0 09:51 pts/1    00:00:00 grep --color=auto postgres

db3上pcs staus 和查看进程
[root at sds3 ~]# pcs status
Error: cluster is not currently running on this node

[root at sds3 ~]# ps -ef |grep postgres
highgo    4388     1  0 09:19 ?        00:00:13 /home/highgo/hgdb/bin/postgres -D /home/highgo/hgdb/data
highgo    4449  4388  0 09:19 ?        00:00:00 postgres: logger process
highgo   10723  4388  2 09:28 ?        00:00:35 postgres: startup process   recovering 0000000900000000000000EF
highgo   10732  4388  0 09:28 ?        00:00:00 postgres: checkpointer process
highgo   10733  4388  0 09:28 ?        00:00:00 postgres: writer process
highgo   11261  4388  0 09:28 ?        00:00:00 postgres: stats collector process
highgo   12229  4388  0 09:50 ?        00:00:00 postgres: wal receiver process
root     12231 17313  0 09:50 pts/0    00:00:00 grep --color=auto postgres
[root at sds3 ~]#

________________________________

[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利！
----------------------------------
徐晓菲  产品检测部
瀚高基础软件股份有限公司
网址：www.highgo.com<http://www.highgo.com/>
地址：济南市高新区新泺大街2117号铭盛大厦20层
手机：183-6307-3951  邮箱：xuxiaofei at highgo.com<mailto:wanghui at highgo.com>

发件人： shaodaming at highgo.com<mailto:shaodaming at highgo.com>
发送时间： 2018-04-27 09:37
收件人： xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>; fanguoteng<mailto:fanguoteng at highgo.com>; 王亮<mailto:wangliang at highgo.com>
主题： 回复: 回复: message+pglog
hi, xiaofei

交叉就是， 如果两个机器作为client server.
一个机器建立400个client访问 备1 数据库1
一个机器建立400 个client 访问 备 2 数据库2
交叉10% 就是 360个访问备1的数据库1， 40个访问备1的数据库2.
                就是 360个访问备2的数据库2， 40个访问备2的数据库1.
其他的情况类似按比例改变如上

thanks.
Br.
Bret
________________________________
shaodaming at highgo.com<mailto:shaodaming at highgo.com>

发件人： xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>
发送时间： 2018-04-27 09:18
收件人： 范国腾<mailto:fanguoteng at highgo.com>; wangliang<mailto:wangliang at highgo.com>; shaodaming<mailto:shaodaming at highgo.com>
主题： 回复: message+pglog
哈喽
    这里的交叉是指，比如100%交叉是同时发select，比如10%交叉是备一读一段时间之后，备二再读 么

________________________________

[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利！
----------------------------------
徐晓菲  产品检测部
瀚高基础软件股份有限公司
网址：www.highgo.com<http://www.highgo.com/>
地址：济南市高新区新泺大街2117号铭盛大厦20层
手机：183-6307-3951  邮箱：xuxiaofei at highgo.com<mailto:wanghui at highgo.com>

发件人： xuxiaofei at highgo.com<mailto:xuxiaofei at highgo.com>
发送时间： 2018-04-26 16:33
收件人： 范国腾<mailto:fanguoteng at highgo.com>; wangliang<mailto:wangliang at highgo.com>; shaodaming<mailto:shaodaming at highgo.com>
主题： message+pglog

________________________________

[cid:_Foxmail.1 at c3ec348c-7674-f49f-eaca-b88858fb2001]祝工作顺利！
----------------------------------
徐晓菲  产品检测部
瀚高基础软件股份有限公司
网址：www.highgo.com<http://www.highgo.com/>
地址：济南市高新区新泺大街2117号铭盛大厦20层
手机：183-6307-3951  邮箱：xuxiaofei at highgo.com<mailto:wanghui at highgo.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 10544 bytes
Desc: image001.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0005.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 18486 bytes
Desc: image002.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0006.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 64127 bytes
Desc: image004.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0007.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 55884 bytes
Desc: image005.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0008.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 63363 bytes
Desc: image003.png
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0009.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: messages.rar
Type: application/octet-stream
Size: 466135 bytes
Desc: messages.rar
URL: <https://lists.clusterlabs.org/pipermail/users/attachments/20180428/e6f11740/attachment-0001.obj>