[ClusterLabs] Is "Process pause detected" triggered too easily?

Tue Sep 26 14:41:38 EDT 2017

Hello,

As the subject line suggests, I am wondering why I see so many of these 
log lines (many means about 10 times per minute, usually several in the 
same second):

Sep 26 19:56:24 [950] vm0 corosync notice  [TOTEM ] Process pause detected 
for 2555 ms, flushing membership messages.
Sep 26 19:56:24 [950] vm0 corosync notice  [TOTEM ] Process pause detected 
for 2558 ms, flushing membership messages.

Let me add some context:
- this is observed in 3 small VMs on my laptop
- the OS is CentOS 7.3, corosync is 2.4.0-9.el7_4.2
- these VMs only run corosync, nothing else
- the VM host (my laptop) is idle 60-80% of the time
- VMs are qemu-kvm guests, connected with tap interfaces
- AND the messages only appear when, on one of the VMs, I do stop/start 
corosync in a tight loop, like this:

[root at vm2 ~]# while :; do echo $(date) stop; systemctl stop corosync ; 
echo $(date) start;systemctl start corosync ; done
Tue Sep 26 19:50:19 CEST 2017 stop
Tue Sep 26 19:50:21 CEST 2017 start
Tue Sep 26 19:50:21 CEST 2017 stop
Tue Sep 26 19:50:22 CEST 2017 start
...

I understand that this kind of test is stressful (and quite articial), but 
I'm still surprised to see these particular messages, because it seems to 
me a bit unlikely that the corosync process is not properly scheduled for 
seconds at a time so frequently (several times per minute).

So I wonder if maybe there could be other explanations?

Also, it looks like the side effect is that corosync drops important 
messages (I think "join" messages?), and I fear that this can lead to 
bigger issues with DLM (which is why I'm looking into this in the first 
place).

In case that's helpful, attached are 10 minutes of corosync log and the 
config file I'm using (it has 5 nodes declared, but I reproduce even with 
just 3 nodes).

Thanks in advance for any suggestion!

Cheers,
JM

-- 
saffroy at gmail.com
-------------- next part --------------
# Please read the corosync.conf.5 manual page

totem {
        config_version: 20170925231703
	version: 2

	transport: udpu

	# How long before declaring a token lost (ms)
	token: 3000

	# How many token retransmits before forming a new configuration
	token_retransmits_before_loss_const: 10

	# How long to wait for join messages in the membership protocol (ms)
	join: 100
	#send_join: 60

	# How long to wait for consensus to be achieved before starting a new round of membership configuration (ms)
	consensus: 3600

	# Turn off the virtual synchrony filter
	vsftype: none

	# Number of messages that may be sent by one processor on receipt of the token
	max_messages: 20

	# Limit generated nodeids to 31-bits (positive signed integers)
	clear_node_high_bit: yes

	# Disable encryption
 	secauth: off

	# How many threads to use for encryption/decryption
 	threads: 0

	# Optionally assign a fixed node id (integer)
	# nodeid: 1234

	# This specifies the mode of redundant ring, which may be none, active, or passive.
 	rrp_mode: none

 	interface {
		# The following values need to be set based on your environment 
		ringnumber: 0
		bindnetaddr: 172.16.0.33
		#broadcast: yes
		#mcastaddr: 226.94.1.1
		#mcastport: 5405
	}

	cluster_name: dlm
}

amf {
	mode: disabled
}

quorum {
	# Quorum for the Pacemaker Cluster Resource Manager
	provider: corosync_votequorum
	#expected_votes: 2
	quorum_votes: 0
	votes: 0
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
	syslog_facility: daemon
        debug: on
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: on
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

nodelist {
	node {
      	     	# vm0
		ring0_addr: 172.16.0.33
		quorum_votes: 1
		nodeid: 1
	}
	node {
      	     	# vm1
		ring0_addr: 172.16.1.33
		quorum_votes: 1
		nodeid: 2
	}
	node {
      	     	# vm2
		ring0_addr: 172.16.2.33
		quorum_votes: 1
		nodeid: 3
	}
	node {
      	     	# vm3
		ring0_addr: 172.16.3.33
		quorum_votes: 0
		nodeid: 4
	}
	node {
      	     	# vm4
		ring0_addr: 172.16.4.33
		quorum_votes: 0
		nodeid: 5
	}
}
-------------- next part --------------
A non-text attachment was scrubbed...
Name: corosync.log.xz
Type: application/x-xz
Size: 186708 bytes
Desc: 
URL: <http://lists.clusterlabs.org/pipermail/users/attachments/20170926/fad420aa/attachment-0002.xz>