[ClusterLabs] Strange Corosync (TOTEM) logs, Pacemaker OK but DLM stuck

Thu Aug 31 18:40:24 EDT 2017

Jan Friesse <jfriesse at redhat.com> writes:

> wferi at niif.hu writes:
>
>> In a 6-node cluster (vhbl03-08) the following happens 1-5 times a day
>> (in August; in May, it happened 0-2 times a day only, it's slowly
>> ramping up):
>>
>> vhbl08 corosync[3687]:   [TOTEM ] A processor failed, forming new configuration.
>> vhbl03 corosync[3890]:   [TOTEM ] A processor failed, forming new configuration.
>> vhbl07 corosync[3805]:   [MAIN  ] Corosync main process was not scheduled for 4317.0054 ms (threshold is 2400.0000 ms). Consider token timeout increase.
>
> ^^^ This is main problem you have to solve. It usually means that
> machine is too overloaded. [...]

Before I start tracing the scheduler, I'd like to ask something: what
wakes up the Corosync main process periodically?  The token making a
full circle?  (Please forgive my simplistic understanding of the TOTEM
protocol.)  That would explain the recommendation in the log message,
but does not fit well with the overload assumption: totally idle nodes
could just as easily produce such warnings if there are no other regular
wakeup sources.  (I'm looking at timer_function_scheduler_timeout but I
know too little of libqb to decide.)

> As a start you can try what message say = Consider token timeout
> increase. Currently you have 3 seconds, in theory 6 second should be
> enough.

It was probably high time I realized that token timeout is scaled
automatically when one has a nodelist.  When you say Corosync should
work OK with default settings up to 16 nodes, you assume this scaling is
in effect, don't you?  On the other hand, I've got no nodelist in the
config, but token = 3000, which is less than the default 1000+4*650 with
six nodes, and this will get worse as the cluster grows.

Comments on the above ramblings welcome!

I'm grateful for all the valuable input poured into this thread by all
parties: it's proven really educative in quite unexpected ways beyond
what I was able to ask in the beginning.
-- 
Thanks,
Feri