[ClusterLabs] Antw: Re: Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Fri Sep 1 02:58:16 EDT 2017
Hi!
I don't know the answer, but I wonder what would happen if corosync runs at
normal scheduling priority. My suspect is that something's wrong, and using
highest real-time priority could be the wrong fix for that problem ;-)
Personally I think a process that does disk I/O and is waiting for network
input cannot be the highest priority real-time job. (Such a candidate would be
a process that had it's memeory locked and is doing shared memory communication
without any I/O)...
Sorry for this off-topic thought.
Regards,
Ulrich
>>> Ferenc Wágner <wferi at niif.hu> schrieb am 01.09.2017 um 00:40 in Nachricht
<87inh38ip3.fsf at lant.ki.iif.hu>:
> Jan Friesse <jfriesse at redhat.com> writes:
>
>> wferi at niif.hu writes:
>>
>>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>>> (in August; in May, it happened 0-2 times a day only, it's slowly
>>> ramping up):
>>>
>>> vhbl08 corosync[3687]: [TOTEM ] A processor failed, forming new
> configuration.
>>> vhbl03 corosync[3890]: [TOTEM ] A processor failed, forming new
> configuration.
>>> vhbl07 corosync[3805]: [MAIN ] Corosync main process was not scheduled
> for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout
> increase.
>>
>> ^^^ This is main problem you have to solve. It usually means that
>> machine is too overloaded. [...]
>
> Before I start tracing the scheduler, I'd like to ask something: what
> wakes up the Corosync main process periodically? The token making a
> full circle? (Please forgive my simplistic understanding of the TOTEM
> protocol.) That would explain the recommendation in the log message,
> but does not fit well with the overload assumption: totally idle nodes
> could just as easily produce such warnings if there are no other regular
> wakeup sources. (I'm looking at timer_function_scheduler_timeout but I
> know too little of libqb to decide.)
>
>> As a start you can try what message say = Consider token timeout
>> increase. Currently you have 3 seconds, in theory 6 second should be
>> enough.
>
> It was probably high time I realized that token timeout is scaled
> automatically when one has a nodelist. When you say Corosync should
> work OK with default settings up to 16 nodes, you assume this scaling is
> in effect, don't you? On the other hand, I've got no nodelist in the
> config, but token = 3000, which is less than the default 1000+4*650 with
> six nodes, and this will get worse as the cluster grows.
>
> Comments on the above ramblings welcome!
>
> I'm grateful for all the valuable input poured into this thread by all
> parties: it's proven really educative in quite unexpected ways beyond
> what I was able to ask in the beginning.
> --
> Thanks,
> Feri
>
> _______________________________________________
> Users mailing list: Users at clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Users
mailing list