[Pacemaker] server lockup failures

Fri Oct 30 10:25:28 UTC 2009

On Friday 30 October 2009, Lars Marowsky-Bree wrote:
> On 2009-10-29T09:58:13, Andrew Beekhof <andrew at beekhof.net> wrote:
> > > Heartbeat based, I still didn't have the time to look into openais.
> >
> > I guess heartbeat wasn't hung then... otherwise it would have stopped
> > sending "i'm here" packets (and dropped out of the membership list).
> 
> Both heartbeat and OpenAIS do quite try not to touch the IO layers to
> avoid being struck by IO latencies.
> 
> Probably not even crmd needs to touch the fs, so it would still send its
> DC keepalive packets and/or respond as the DC. Things like this need to
> be caught via resource agent monitoring.

I'm afraid it is not that simple. One of the resources was marked as failed in 
crm_mon output, but still pacemaker didn't do anything to migrate the 
resource. Manual attempts to stop resources also failed. Only after I invoked 
stonith myself to reboot the failed server, DC also migrate and pacemaker 
started to work again. I hope I will have some time in the afternoon to start 
to debug this.

Cheers,
Bernd