[Pacemaker] Corosync and Pacemaker Hangs
Norbert Kiam Maclang
norbert.kiam.maclang at gmail.com
Thu Sep 11 10:58:47 UTC 2014
Thank you Vladislav.
I have configured resource level fencing on drbd and removed wfc-timeout
and defr-wfc-timeout (is this required?). My drbd configuration is now:
resource pg {
device /dev/drbd0;
disk /dev/vdb;
meta-disk internal;
disk {
fencing resource-only;
on-io-error detach;
resync-rate 40M;
}
handlers {
fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm";
}
on node01 {
address 10.2.136.52:7789;
}
on node02 {
address 10.2.136.55:7789;
}
net {
verify-alg md5;
after-sb-0pri discard-zero-changes;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
}
Failover works on my initial test (restarting both nodes alternately - this
always works). Will wait for a couple of hours after doing a failover test
again (Which always fail on my previous setup).
Thank you!
Kiam
On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov <bubble at hoster-ok.com>
wrote:
> 11.09.2014 05:57, Norbert Kiam Maclang wrote:
> > Is this something to do with quorum? But I already set
>
> You'd need to configure fencing at the drbd resources level.
>
>
> http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib
>
>
> >
> > property no-quorum-policy="ignore" \
> > expected-quorum-votes="1"
> >
> > Thanks in advance,
> > Kiam
> >
> > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang
> > <norbert.kiam.maclang at gmail.com <mailto:norbert.kiam.maclang at gmail.com>>
> > wrote:
> >
> > Hi,
> >
> > Please help me understand what is causing the problem. I have a 2
> > node cluster running on vms using KVM. Each vm (I am using Ubuntu
> > 14.04) runs on a separate hypervisor on separate machines. All are
> > working well during testing (I restarted the vms alternately), but
> > after a day when I kill the other node, I always end up corosync and
> > pacemaker hangs on the surviving node. Date and time on the vms are
> > in sync, I use unicast, tcpdump shows both nodes exchanges,
> > confirmed that DRBD is healthy and crm_mon show good status before I
> > kill the other node. Below are my configurations and versions I used:
> >
> > corosync 2.3.3-1ubuntu1
> > crmsh 1.2.5+hg1034-1ubuntu3
> > drbd8-utils 2:8.4.4-1ubuntu1
> > libcorosync-common4 2.3.3-1ubuntu1
> > libcrmcluster4 1.1.10+git20130802-1ubuntu2
> > libcrmcommon3 1.1.10+git20130802-1ubuntu2
> > libcrmservice1 1.1.10+git20130802-1ubuntu2
> > pacemaker 1.1.10+git20130802-1ubuntu2
> > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2
> > postgresql-9.3 9.3.5-0ubuntu0.14.04.1
> >
> > # /etc/corosync/corosync:
> > totem {
> > version: 2
> > token: 3000
> > token_retransmits_before_loss_const: 10
> > join: 60
> > consensus: 3600
> > vsftype: none
> > max_messages: 20
> > clear_node_high_bit: yes
> > secauth: off
> > threads: 0
> > rrp_mode: none
> > interface {
> > member {
> > memberaddr: 10.2.136.56
> > }
> > member {
> > memberaddr: 10.2.136.57
> > }
> > ringnumber: 0
> > bindnetaddr: 10.2.136.0
> > mcastport: 5405
> > }
> > transport: udpu
> > }
> > amf {
> > mode: disabled
> > }
> > quorum {
> > provider: corosync_votequorum
> > expected_votes: 1
> > }
> > aisexec {
> > user: root
> > group: root
> > }
> > logging {
> > fileline: off
> > to_stderr: yes
> > to_logfile: no
> > to_syslog: yes
> > syslog_facility: daemon
> > debug: off
> > timestamp: on
> > logger_subsys {
> > subsys: AMF
> > debug: off
> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6
> > }
> > }
> >
> > # /etc/corosync/service.d/pcmk:
> > service {
> > name: pacemaker
> > ver: 1
> > }
> >
> > /etc/drbd.d/global_common.conf:
> > global {
> > usage-count no;
> > }
> >
> > common {
> > net {
> > protocol C;
> > }
> > }
> >
> > # /etc/drbd.d/pg.res:
> > resource pg {
> > device /dev/drbd0;
> > disk /dev/vdb;
> > meta-disk internal;
> > startup {
> > wfc-timeout 15;
> > degr-wfc-timeout 60;
> > }
> > disk {
> > on-io-error detach;
> > resync-rate 40M;
> > }
> > on node01 {
> > address 10.2.136.56:7789 <http://10.2.136.56:7789>;
> > }
> > on node02 {
> > address 10.2.136.57:7789 <http://10.2.136.57:7789>;
> > }
> > net {
> > verify-alg md5;
> > after-sb-0pri discard-zero-changes;
> > after-sb-1pri discard-secondary;
> > after-sb-2pri disconnect;
> > }
> > }
> >
> > # Pacemaker configuration:
> > node $id="167938104" node01
> > node $id="167938105" node02
> > primitive drbd_pg ocf:linbit:drbd \
> > params drbd_resource="pg" \
> > op monitor interval="29s" role="Master" \
> > op monitor interval="31s" role="Slave"
> > primitive fs_pg ocf:heartbeat:Filesystem \
> > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"
> > fstype="ext4"
> > primitive ip_pg ocf:heartbeat:IPaddr2 \
> > params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
> > primitive lsb_pg lsb:postgresql
> > group PGServer fs_pg lsb_pg ip_pg
> > ms ms_drbd_pg drbd_pg \
> > meta master-max="1" master-node-max="1" clone-max="2"
> > clone-node-max="1" notify="true"
> > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
> > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
> > property $id="cib-bootstrap-options" \
> > dc-version="1.1.10-42f2063" \
> > cluster-infrastructure="corosync" \
> > stonith-enabled="false" \
> > no-quorum-policy="ignore"
> > rsc_defaults $id="rsc-options" \
> > resource-stickiness="100"
> >
> > # Logs on node01
> > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback:
> > Our peer on the DC is dead
> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:
> > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:
> > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> > cause=C_FSA_INTERNAL origin=do_election_check ]
> > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership
> > (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left:
> > 167938105
> > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did
> > not arrive in time.
> > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(
> > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
> > UpToDate -> DUnknown )
> > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender
> > terminated
> > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating
> > drbd_a_pg
> > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection
> > closed
> > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(
> > NetworkFailure -> Unconnected )
> > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver
> > terminated
> > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting
> > receiver thread
> > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver
> > (re)started
> > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(
> > Unconnected -> WFConnection )
> > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 8445) timed out
> > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:8445 - timed out after 20000ms
> > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM
> > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
> > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback:
> > Resource update 23 failed: (rc=-62) Timer expired
> > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 8693) timed out
> > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:8693 - timed out after 20000ms
> > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 8938) timed out
> > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:8938 - timed out after 20000ms
> > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped:
> > Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION!
> > (180000ms)
> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition:
> > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1
> > cluster nodes failed to respond to the join offer.
> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:
> > join-1: node02=none
> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:
> > join-1: node01=welcomed
> > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 9185) timed out
> > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:9185 - timed out after 20000ms
> > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 9432) timed out
> > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:9432 - timed out after 20000ms
> > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 9680) timed out
> > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:9680 - timed out after 20000ms
> > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 9927) timed out
> > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:9927 - timed out after 20000ms
> > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback:
> > drbd_pg_monitor_31000 process (PID 10174) timed out
> > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished:
> > drbd_pg_monitor_31000:10174 - timed out after 20000ms
> >
> > #crm_mon on node01 before I kill the other vm:
> > Stack: corosync
> > Current DC: node02 (167938104) - partition with quorum
> > Version: 1.1.10-42f2063
> > 2 Nodes configured
> > 5 Resources configured
> >
> > Online: [ node01 node02 ]
> >
> > Resource Group: PGServer
> > fs_pg (ocf::heartbeat:Filesystem): Started node02
> > lsb_pg (lsb:postgresql): Started node02
> > ip_pg (ocf::heartbeat:IPaddr2): Started node02
> > Master/Slave Set: ms_drbd_pg [drbd_pg]
> > Masters: [ node02 ]
> > Slaves: [ node01 ]
> >
> > Thank you,
> > Kiam
> >
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140911/94df2de0/attachment.htm>
More information about the Pacemaker
mailing list