[Pacemaker] TOTEM: Process pause detected? Leading to STONITH...

Fri Aug 12 00:11:12 UTC 2011

On 08/11/2011 03:05 AM, Sebastian Kaps wrote:
> Hi,
> 
> On 04.08.2011, at 18:21, Steven Dake wrote:
> 
>>> Jul 31 03:51:02 node01 corosync[5870]:  [TOTEM ] Process pause detected
>>> for 11149 ms, flushing membership messages.
>>
>> This process pause message indicates the scheduler doesn't schedule
>> corosync for 11 seconds which is greater then the failure detection
>> timeouts.  What does your config file look like?  What load are you running?
> 
> 
> We've had another one of these this morning:
> "Process pause detected for 11763 ms, flushing membership messages."
> According to the graphs that are generated from Nagios data, the load of that system 
> jumped from 1.0 to 5.1 ca. 2 minutes before this event, stayed at that value for 
> ~5 minutes then dropped to below 1 afterwards. 10 Minutes later the system got shot,

Did nagios possibly block for 10+ seconds during this time as well?  In
this case, it wouldn't detect any spikes or delays in scheduling.

Are you running in a virtual machine or on old/slow hardware?

RE deadline cpu scheduler, the only thing I can find about that topic is
a new scheduling class.  Corosync doesn't take advantage of that
scheduling class (its not in the linux 3.0 glibc man pages - if it is
there, we don't know how to use it).

> probably because the OCFS2 got confused by the node leaving the cluster.
> At that time, the machine was only the standby node. The only things that could 
> have been running then, are a daily backup run (TSM) that starts the night before 
> and takes a few hours to complete - and the OCFS2-related processes (the backup of 
> the OCFS2 filesystem is done on that machine).
> 

I would really like someone that has these process pause problems to
test a patch I have posted to see if it rectifies the situation.  Our
significant QE team at Red Hat doesn't see these problems and I can't
generate them in engineering.  It is possible your device drivers are
taking spinlocks for extended periods or some other kernel problem is
occurring.

If you feel up to the task of building your own corosync, try out this
patch:

http://marc.info/?l=openais&m=130989380207300&w=2

Regards
-steve

> What can I do to investigate this behavior? We've switched to the "deadline" cpu 
> scheduler before the July 31st event. Could this cause this kind of behavior?
> I was under the impression, that 'deadline' was designed to prevent exactly these
> kinds of situations.
> Further increasing the timeout above the current value of 10s doesn't look like
> it's the solution for this problem.
> 
> The configuration is unchanged from the one I posted on August 4th.
> The funny thing is, that the cluster did not show any problems since July 31st.
> 
> Thanks in advance!
>