[Pacemaker] Corosync and Pacemaker Hangs
Norbert Kiam Maclang
norbert.kiam.maclang at gmail.com
Thu Sep 11 02:09:45 UTC 2014
Hi,
Please help me understand what is causing the problem. I have a 2 node
cluster running on vms using KVM. Each vm (I am using Ubuntu 14.04) runs on
a separate hypervisor on separate machines. All are working well during
testing (I restarted the vms alternately), but after a day when I kill the
other node, I always end up corosync and pacemaker hangs on the surviving
node. Date and time on the vms are in sync, I use unicast, tcpdump shows
both nodes exchanges, confirmed that DRBD is healthy and crm_mon show good
status before I kill the other node. Below are my configurations and
versions I used:
corosync 2.3.3-1ubuntu1
crmsh 1.2.5+hg1034-1ubuntu3
drbd8-utils 2:8.4.4-1ubuntu1
libcorosync-common4 2.3.3-1ubuntu1
libcrmcluster4 1.1.10+git20130802-1ubuntu2
libcrmcommon3 1.1.10+git20130802-1ubuntu2
libcrmservice1 1.1.10+git20130802-1ubuntu2
pacemaker 1.1.10+git20130802-1ubuntu2
pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2
postgresql-9.3 9.3.5-0ubuntu0.14.04.1
# /etc/corosync/corosync:
totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 3600
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: none
interface {
member {
memberaddr: 10.2.136.56
}
member {
memberaddr: 10.2.136.57
}
ringnumber: 0
bindnetaddr: 10.2.136.0
mcastport: 5405
}
transport: udpu
}
amf {
mode: disabled
}
quorum {
provider: corosync_votequorum
expected_votes: 1
}
aisexec {
user: root
group: root
}
logging {
fileline: off
to_stderr: yes
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}
# /etc/corosync/service.d/pcmk:
service {
name: pacemaker
ver: 1
}
/etc/drbd.d/global_common.conf:
global {
usage-count no;
}
common {
net {
protocol C;
}
}
# /etc/drbd.d/pg.res:
resource pg {
device /dev/drbd0;
disk /dev/vdb;
meta-disk internal;
startup {
wfc-timeout 15;
degr-wfc-timeout 60;
}
disk {
on-io-error detach;
resync-rate 40M;
}
on node01 {
address 10.2.136.56:7789;
}
on node02 {
address 10.2.136.57:7789;
}
net {
verify-alg md5;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}
# Pacemaker configuration:
node $id="167938104" node01
node $id="167938105" node02
primitive drbd_pg ocf:linbit:drbd \
params drbd_resource="pg" \
op monitor interval="29s" role="Master" \
op monitor interval="31s" role="Slave"
primitive fs_pg ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"
fstype="ext4"
primitive ip_pg ocf:heartbeat:IPaddr2 \
params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
primitive lsb_pg lsb:postgresql
group PGServer fs_pg lsb_pg ip_pg
ms ms_drbd_pg drbd_pg \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
notify="true"
colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
resource-stickiness="100"
# Logs on node01
Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback: Our peer
on the DC is dead
Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: State
transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition: State
transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
cause=C_FSA_INTERNAL origin=do_election_check ]
Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership (
10.2.136.56:52) was formed. Members left: 167938105
Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did not
arrive in time.
Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer( Primary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender terminated
Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating
drbd_a_pg
Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection closed
Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(
NetworkFailure -> Unconnected )
Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver terminated
Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting receiver
thread
Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver (re)started
Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn( Unconnected
-> WFConnection )
Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 8445) timed out
Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:8445 - timed out after 20000ms
Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM
operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback: Resource
update 23 failed: (rc=-62) Timer expired
Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 8693) timed out
Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:8693 - timed out after 20000ms
Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 8938) timed out
Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:8938 - timed out after 20000ms
Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped: Integration
Timer (I_INTEGRATED) just popped in state S_INTEGRATION! (180000ms)
Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition:
Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1 cluster
nodes failed to respond to the join offer.
Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: join-1:
node02=none
Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log: join-1:
node01=welcomed
Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 9185) timed out
Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:9185 - timed out after 20000ms
Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 9432) timed out
Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:9432 - timed out after 20000ms
Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 9680) timed out
Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:9680 - timed out after 20000ms
Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 9927) timed out
Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:9927 - timed out after 20000ms
Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback:
drbd_pg_monitor_31000 process (PID 10174) timed out
Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished:
drbd_pg_monitor_31000:10174 - timed out after 20000ms
#crm_mon on node01 before I kill the other vm:
Stack: corosync
Current DC: node02 (167938104) - partition with quorum
Version: 1.1.10-42f2063
2 Nodes configured
5 Resources configured
Online: [ node01 node02 ]
Resource Group: PGServer
fs_pg (ocf::heartbeat:Filesystem): Started node02
lsb_pg (lsb:postgresql): Started node02
ip_pg (ocf::heartbeat:IPaddr2): Started node02
Master/Slave Set: ms_drbd_pg [drbd_pg]
Masters: [ node02 ]
Slaves: [ node01 ]
Thank you,
Kiam
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140911/be408f4c/attachment-0003.html>
More information about the Pacemaker
mailing list