[Pacemaker] server lockup failures
Bernd Schubert
bernd.schubert at fastmail.fm
Fri Oct 30 10:25:28 UTC 2009
On Friday 30 October 2009, Lars Marowsky-Bree wrote:
> On 2009-10-29T09:58:13, Andrew Beekhof <andrew at beekhof.net> wrote:
> > > Heartbeat based, I still didn't have the time to look into openais.
> >
> > I guess heartbeat wasn't hung then... otherwise it would have stopped
> > sending "i'm here" packets (and dropped out of the membership list).
>
> Both heartbeat and OpenAIS do quite try not to touch the IO layers to
> avoid being struck by IO latencies.
>
> Probably not even crmd needs to touch the fs, so it would still send its
> DC keepalive packets and/or respond as the DC. Things like this need to
> be caught via resource agent monitoring.
I'm afraid it is not that simple. One of the resources was marked as failed in
crm_mon output, but still pacemaker didn't do anything to migrate the
resource. Manual attempts to stop resources also failed. Only after I invoked
stonith myself to reboot the failed server, DC also migrate and pacemaker
started to work again. I hope I will have some time in the afternoon to start
to debug this.
Cheers,
Bernd
More information about the Pacemaker
mailing list