[Pacemaker] server lockup failures

Andrew Beekhof andrew at beekhof.net
Wed Oct 28 07:54:26 EDT 2009


On Wed, Oct 28, 2009 at 1:05 PM, Bernd Schubert
<bs_lists at aakef.fastmail.fm> wrote:
> Hello,
>
> I think there is a severe server failure pacemaker doesn't detect. Over night
> a Lustre server failed in shrink_icache_memory() and probably it had a lock on
> dcache_lock. Now this is a global filesystem lock and when a filesystem fails
> while this is locked, any IO on this system just hangs.

And the FS in question was / so Pacemaker basically hung?

> And I think pacemaker
> doesn't detect this failure. So DC was the failed node and of course, I
> couldn't login anymore, but ping still worked. On the other server crm_mon
> showed one failed resource (monitor), but it simply didn't do anything.
>
> This is with pacemaker 1.04.
>
> I think I should be able to reproduce this rather quickly, by adding a wrong
> dcache_lock into Lustre. The question is now how can we fix this in pacemaker?




More information about the Pacemaker mailing list