[ClusterLabs] Antw: Re: Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Fri Sep 1 02:58:16 EDT 2017

Hi!

I don't know the answer, but I wonder what would happen if corosync runs at
normal scheduling priority. My suspect is that something's wrong, and using
highest real-time priority could be the wrong fix for that problem ;-)

Personally I think a process that does disk I/O and is waiting for network
input cannot be the highest priority real-time job. (Such a candidate would be
a process that had it's memeory locked and is doing shared memory communication
without any I/O)...

Sorry for this off-topic thought.

Regards,
Ulrich

>>> Ferenc Wágner <wferi at niif.hu> schrieb am 01.09.2017 um 00:40 in Nachricht
<87inh38ip3.fsf at lant.ki.iif.hu>:
> Jan Friesse <jfriesse at redhat.com> writes:
> 
>> wferi at niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new 
> configuration.
>>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new 
> configuration.
>>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled

> for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout 
> increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. [...]
> 
> Before I start tracing the scheduler, I'd like to ask something: what
> wakes up the Corosync main process periodically?  The token making a
> full circle?  (Please forgive my simplistic understanding of the TOTEM
> protocol.)  That would explain the recommendation in the log message,
> but does not fit well with the overload assumption: totally idle nodes
> could just as easily produce such warnings if there are no other regular
> wakeup sources.  (I'm looking at timer_function_scheduler_timeout but I
> know too little of libqb to decide.)
> 
>> As a start you can try what message say = Consider token timeout
>> increase. Currently you have 3 seconds, in theory 6 second should be
>> enough.
> 
> It was probably high time I realized that token timeout is scaled
> automatically when one has a nodelist.  When you say Corosync should
> work OK with default settings up to 16 nodes, you assume this scaling is
> in effect, don't you?  On the other hand, I've got no nodelist in the
> config, but token = 3000, which is less than the default 1000+4*650 with
> six nodes, and this will get worse as the cluster grows.
> 
> Comments on the above ramblings welcome!
> 
> I'm grateful for all the valuable input poured into this thread by all
> parties: it's proven really educative in quite unexpected ways beyond
> what I was able to ask in the beginning.
> -- 
> Thanks,
> Feri
> 
> _______________________________________________
> Users mailing list: Users at clusterlabs.org 
> http://lists.clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org