[Pacemaker] Corosync and Pacemaker Hangs
Norbert Kiam Maclang
norbert.kiam.maclang at gmail.com
Fri Sep 12 02:00:21 UTC 2014
Hi,
After adding resource level fencing on drbd, I still ended up having
problems with timeouts on drbd. Is there a recommended settings for this? I
followed what is written in the drbd documentation -
http://www.drbd.org/users-guide-emb/s-pacemaker-crm-drbd-backed-service.html
, Another thing I can't understand is why during initial tests, even I
reboot the vms several times, failover works. But after I soak it for a
couple of hours (say for example 8 hours or more) and continue with the
tests, it will not failover and experience split brain. I confirmed it
though that everything is healthy before performing a reboot. Disk health
and network is good, drbd is synced, time beetween servers is good.
# Logs:
node01 lrmd[1036]: warning: child_timeout_callback: drbd_pg_monitor_29000
process (PID 27744) timed out
node01 lrmd[1036]: warning: operation_finished:
drbd_pg_monitor_29000:27744 - timed out after 20000ms
node01 crmd[1039]: error: process_lrm_event: LRM operation
drbd_pg_monitor_29000 (69) Timed Out (timeout=20000ms)
node01 crmd[1039]: warning: update_failcount: Updating failcount for
drbd_pg on tyo1mqdb01p after failed monitor: rc=1 (update=value++,
time=1410486352)
Thanks,
Kiam
On Thu, Sep 11, 2014 at 6:58 PM, Norbert Kiam Maclang <
norbert.kiam.maclang at gmail.com> wrote:
> Thank you Vladislav.
>
> I have configured resource level fencing on drbd and removed wfc-timeout
> and defr-wfc-timeout (is this required?). My drbd configuration is now:
>
> resource pg {
> device /dev/drbd0;
> disk /dev/vdb;
> meta-disk internal;
> disk {
> fencing resource-only;
> on-io-error detach;
> resync-rate 40M;
> }
> handlers {
> fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> split-brain "/usr/lib/drbd/notify-split-brain.sh nkbm";
> }
> on node01 {
> address 10.2.136.52:7789;
> }
> on node02 {
> address 10.2.136.55:7789;
> }
> net {
> verify-alg md5;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri disconnect;
> }
> }
>
> Failover works on my initial test (restarting both nodes alternately -
> this always works). Will wait for a couple of hours after doing a failover
> test again (Which always fail on my previous setup).
>
> Thank you!
> Kiam
>
> On Thu, Sep 11, 2014 at 2:14 PM, Vladislav Bogdanov <bubble at hoster-ok.com>
> wrote:
>
>> 11.09.2014 05:57, Norbert Kiam Maclang wrote:
>> > Is this something to do with quorum? But I already set
>>
>> You'd need to configure fencing at the drbd resources level.
>>
>>
>> http://www.drbd.org/users-guide-emb/s-pacemaker-fencing.html#s-pacemaker-fencing-cib
>>
>>
>> >
>> > property no-quorum-policy="ignore" \
>> > expected-quorum-votes="1"
>> >
>> > Thanks in advance,
>> > Kiam
>> >
>> > On Thu, Sep 11, 2014 at 10:09 AM, Norbert Kiam Maclang
>> > <norbert.kiam.maclang at gmail.com <mailto:norbert.kiam.maclang at gmail.com
>> >>
>> > wrote:
>> >
>> > Hi,
>> >
>> > Please help me understand what is causing the problem. I have a 2
>> > node cluster running on vms using KVM. Each vm (I am using Ubuntu
>> > 14.04) runs on a separate hypervisor on separate machines. All are
>> > working well during testing (I restarted the vms alternately), but
>> > after a day when I kill the other node, I always end up corosync and
>> > pacemaker hangs on the surviving node. Date and time on the vms are
>> > in sync, I use unicast, tcpdump shows both nodes exchanges,
>> > confirmed that DRBD is healthy and crm_mon show good status before I
>> > kill the other node. Below are my configurations and versions I
>> used:
>> >
>> > corosync 2.3.3-1ubuntu1
>> > crmsh 1.2.5+hg1034-1ubuntu3
>> > drbd8-utils 2:8.4.4-1ubuntu1
>> > libcorosync-common4 2.3.3-1ubuntu1
>> > libcrmcluster4 1.1.10+git20130802-1ubuntu2
>> > libcrmcommon3 1.1.10+git20130802-1ubuntu2
>> > libcrmservice1 1.1.10+git20130802-1ubuntu2
>> > pacemaker 1.1.10+git20130802-1ubuntu2
>> > pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2
>> > postgresql-9.3 9.3.5-0ubuntu0.14.04.1
>> >
>> > # /etc/corosync/corosync:
>> > totem {
>> > version: 2
>> > token: 3000
>> > token_retransmits_before_loss_const: 10
>> > join: 60
>> > consensus: 3600
>> > vsftype: none
>> > max_messages: 20
>> > clear_node_high_bit: yes
>> > secauth: off
>> > threads: 0
>> > rrp_mode: none
>> > interface {
>> > member {
>> > memberaddr: 10.2.136.56
>> > }
>> > member {
>> > memberaddr: 10.2.136.57
>> > }
>> > ringnumber: 0
>> > bindnetaddr: 10.2.136.0
>> > mcastport: 5405
>> > }
>> > transport: udpu
>> > }
>> > amf {
>> > mode: disabled
>> > }
>> > quorum {
>> > provider: corosync_votequorum
>> > expected_votes: 1
>> > }
>> > aisexec {
>> > user: root
>> > group: root
>> > }
>> > logging {
>> > fileline: off
>> > to_stderr: yes
>> > to_logfile: no
>> > to_syslog: yes
>> > syslog_facility: daemon
>> > debug: off
>> > timestamp: on
>> > logger_subsys {
>> > subsys: AMF
>> > debug: off
>> > tags: enter|leave|trace1|trace2|trace3|trace4|trace6
>> > }
>> > }
>> >
>> > # /etc/corosync/service.d/pcmk:
>> > service {
>> > name: pacemaker
>> > ver: 1
>> > }
>> >
>> > /etc/drbd.d/global_common.conf:
>> > global {
>> > usage-count no;
>> > }
>> >
>> > common {
>> > net {
>> > protocol C;
>> > }
>> > }
>> >
>> > # /etc/drbd.d/pg.res:
>> > resource pg {
>> > device /dev/drbd0;
>> > disk /dev/vdb;
>> > meta-disk internal;
>> > startup {
>> > wfc-timeout 15;
>> > degr-wfc-timeout 60;
>> > }
>> > disk {
>> > on-io-error detach;
>> > resync-rate 40M;
>> > }
>> > on node01 {
>> > address 10.2.136.56:7789 <http://10.2.136.56:7789>;
>> > }
>> > on node02 {
>> > address 10.2.136.57:7789 <http://10.2.136.57:7789>;
>> > }
>> > net {
>> > verify-alg md5;
>> > after-sb-0pri discard-zero-changes;
>> > after-sb-1pri discard-secondary;
>> > after-sb-2pri disconnect;
>> > }
>> > }
>> >
>> > # Pacemaker configuration:
>> > node $id="167938104" node01
>> > node $id="167938105" node02
>> > primitive drbd_pg ocf:linbit:drbd \
>> > params drbd_resource="pg" \
>> > op monitor interval="29s" role="Master" \
>> > op monitor interval="31s" role="Slave"
>> > primitive fs_pg ocf:heartbeat:Filesystem \
>> > params device="/dev/drbd0" directory="/var/lib/postgresql/9.3/main"
>> > fstype="ext4"
>> > primitive ip_pg ocf:heartbeat:IPaddr2 \
>> > params ip="10.2.136.59" cidr_netmask="24" nic="eth0"
>> > primitive lsb_pg lsb:postgresql
>> > group PGServer fs_pg lsb_pg ip_pg
>> > ms ms_drbd_pg drbd_pg \
>> > meta master-max="1" master-node-max="1" clone-max="2"
>> > clone-node-max="1" notify="true"
>> > colocation pg_on_drbd inf: PGServer ms_drbd_pg:Master
>> > order pg_after_drbd inf: ms_drbd_pg:promote PGServer:start
>> > property $id="cib-bootstrap-options" \
>> > dc-version="1.1.10-42f2063" \
>> > cluster-infrastructure="corosync" \
>> > stonith-enabled="false" \
>> > no-quorum-policy="ignore"
>> > rsc_defaults $id="rsc-options" \
>> > resource-stickiness="100"
>> >
>> > # Logs on node01
>> > Sep 10 10:25:33 node01 crmd[1019]: notice: peer_update_callback:
>> > Our peer on the DC is dead
>> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:
>> > State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION
>> > cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
>> > Sep 10 10:25:33 node01 crmd[1019]: notice: do_state_transition:
>> > State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
>> > cause=C_FSA_INTERNAL origin=do_election_check ]
>> > Sep 10 10:25:33 node01 corosync[940]: [TOTEM ] A new membership
>> > (10.2.136.56:52 <http://10.2.136.56:52>) was formed. Members left:
>> > 167938105
>> > Sep 10 10:25:45 node01 kernel: [74452.740024] d-con pg: PingAck did
>> > not arrive in time.
>> > Sep 10 10:25:45 node01 kernel: [74452.740169] d-con pg: peer(
>> > Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
>> > UpToDate -> DUnknown )
>> > Sep 10 10:25:45 node01 kernel: [74452.740987] d-con pg: asender
>> > terminated
>> > Sep 10 10:25:45 node01 kernel: [74452.740999] d-con pg: Terminating
>> > drbd_a_pg
>> > Sep 10 10:25:45 node01 kernel: [74452.741235] d-con pg: Connection
>> > closed
>> > Sep 10 10:25:45 node01 kernel: [74452.741259] d-con pg: conn(
>> > NetworkFailure -> Unconnected )
>> > Sep 10 10:25:45 node01 kernel: [74452.741260] d-con pg: receiver
>> > terminated
>> > Sep 10 10:25:45 node01 kernel: [74452.741261] d-con pg: Restarting
>> > receiver thread
>> > Sep 10 10:25:45 node01 kernel: [74452.741262] d-con pg: receiver
>> > (re)started
>> > Sep 10 10:25:45 node01 kernel: [74452.741269] d-con pg: conn(
>> > Unconnected -> WFConnection )
>> > Sep 10 10:26:12 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 8445) timed out
>> > Sep 10 10:26:12 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:8445 - timed out after 20000ms
>> > Sep 10 10:26:12 node01 crmd[1019]: error: process_lrm_event: LRM
>> > operation drbd_pg_monitor_31000 (30) Timed Out (timeout=20000ms)
>> > Sep 10 10:26:32 node01 crmd[1019]: warning: cib_rsc_callback:
>> > Resource update 23 failed: (rc=-62) Timer expired
>> > Sep 10 10:27:03 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 8693) timed out
>> > Sep 10 10:27:03 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:8693 - timed out after 20000ms
>> > Sep 10 10:27:54 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 8938) timed out
>> > Sep 10 10:27:54 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:8938 - timed out after 20000ms
>> > Sep 10 10:28:33 node01 crmd[1019]: error: crm_timer_popped:
>> > Integration Timer (I_INTEGRATED) just popped in state S_INTEGRATION!
>> > (180000ms)
>> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition:
>> > Progressed to state S_FINALIZE_JOIN after C_TIMER_POPPED
>> > Sep 10 10:28:33 node01 crmd[1019]: warning: do_state_transition: 1
>> > cluster nodes failed to respond to the join offer.
>> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:
>> > join-1: node02=none
>> > Sep 10 10:28:33 node01 crmd[1019]: notice: crmd_join_phase_log:
>> > join-1: node01=welcomed
>> > Sep 10 10:28:45 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 9185) timed out
>> > Sep 10 10:28:45 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:9185 - timed out after 20000ms
>> > Sep 10 10:29:36 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 9432) timed out
>> > Sep 10 10:29:36 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:9432 - timed out after 20000ms
>> > Sep 10 10:30:27 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 9680) timed out
>> > Sep 10 10:30:27 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:9680 - timed out after 20000ms
>> > Sep 10 10:31:18 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 9927) timed out
>> > Sep 10 10:31:18 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:9927 - timed out after 20000ms
>> > Sep 10 10:32:09 node01 lrmd[1016]: warning: child_timeout_callback:
>> > drbd_pg_monitor_31000 process (PID 10174) timed out
>> > Sep 10 10:32:09 node01 lrmd[1016]: warning: operation_finished:
>> > drbd_pg_monitor_31000:10174 - timed out after 20000ms
>> >
>> > #crm_mon on node01 before I kill the other vm:
>> > Stack: corosync
>> > Current DC: node02 (167938104) - partition with quorum
>> > Version: 1.1.10-42f2063
>> > 2 Nodes configured
>> > 5 Resources configured
>> >
>> > Online: [ node01 node02 ]
>> >
>> > Resource Group: PGServer
>> > fs_pg (ocf::heartbeat:Filesystem): Started node02
>> > lsb_pg (lsb:postgresql): Started node02
>> > ip_pg (ocf::heartbeat:IPaddr2): Started node02
>> > Master/Slave Set: ms_drbd_pg [drbd_pg]
>> > Masters: [ node02 ]
>> > Slaves: [ node01 ]
>> >
>> > Thank you,
>> > Kiam
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > Project Home: http://www.clusterlabs.org
>> > Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> > Bugs: http://bugs.clusterlabs.org
>> >
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140912/4e2b0c0c/attachment.htm>
More information about the Pacemaker
mailing list