[Pacemaker] server lockup failures
Bernd Schubert
bs_lists at aakef.fastmail.fm
Wed Oct 28 11:05:24 UTC 2009
Hello,
I think there is a severe server failure pacemaker doesn't detect. Over night
a Lustre server failed in shrink_icache_memory() and probably it had a lock on
dcache_lock. Now this is a global filesystem lock and when a filesystem fails
while this is locked, any IO on this system just hangs. And I think pacemaker
doesn't detect this failure. So DC was the failed node and of course, I
couldn't login anymore, but ping still worked. On the other server crm_mon
showed one failed resource (monitor), but it simply didn't do anything.
This is with pacemaker 1.04.
I think I should be able to reproduce this rather quickly, by adding a wrong
dcache_lock into Lustre. The question is now how can we fix this in pacemaker?
Thanks,
Bernd
--
Bernd Schubert
DataDirect Networks
More information about the Pacemaker
mailing list