[ClusterLabs] Is "Process pause detected" triggered too easily?

Wed Sep 27 09:06:13 EDT 2017

Jean,

> Hello,
>
> As the subject line suggests, I am wondering why I see so many of these
> log lines (many means about 10 times per minute, usually several in the
> same second):
>
> Sep 26 19:56:24 [950] vm0 corosync notice  [TOTEM ] Process pause detected
> for 2555 ms, flushing membership messages.
> Sep 26 19:56:24 [950] vm0 corosync notice  [TOTEM ] Process pause detected
> for 2558 ms, flushing membership messages.
>
> Let me add some context:
> - this is observed in 3 small VMs on my laptop
> - the OS is CentOS 7.3, corosync is 2.4.0-9.el7_4.2
> - these VMs only run corosync, nothing else
> - the VM host (my laptop) is idle 60-80% of the time
> - VMs are qemu-kvm guests, connected with tap interfaces
> - AND the messages only appear when, on one of the VMs, I do stop/start
> corosync in a tight loop, like this:
>
> [root at vm2 ~]# while :; do echo $(date) stop; systemctl stop corosync ;
> echo $(date) start;systemctl start corosync ; done
> Tue Sep 26 19:50:19 CEST 2017 stop
> Tue Sep 26 19:50:21 CEST 2017 start
> Tue Sep 26 19:50:21 CEST 2017 stop
> Tue Sep 26 19:50:22 CEST 2017 start
> ...
>
> I understand that this kind of test is stressful (and quite articial), but
> I'm still surprised to see these particular messages, because it seems to
> me a bit unlikely that the corosync process is not properly scheduled for
> seconds at a time so frequently (several times per minute).

I don't think scheduling is the case. If scheduler would be the case 
other message (Corosync main process was not scheduled for ...) would 
kick in. This looks more like a something is blocked in totemsrp.

>
> So I wonder if maybe there could be other explanations?
>
> Also, it looks like the side effect is that corosync drops important
> messages (I think "join" messages?), and I fear that this can lead to

You mean membership join messages? Because there are a lot (327) of them 
in log you've sent.

> bigger issues with DLM (which is why I'm looking into this in the first
> place).
>
> In case that's helpful, attached are 10 minutes of corosync log and the
> config file I'm using (it has 5 nodes declared, but I reproduce even with
> just 3 nodes).
>
> Thanks in advance for any suggestion!

I'll definitively try to reproduce this bug and let you know. I don't 
think any message get lost, but it's better to be on a safe side.

Regards,
   Honza

>
>
> Cheers,
> JM
>
>
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>