[Pacemaker] server lockup failures

Bernd Schubert bs_lists at aakef.fastmail.fm
Wed Oct 28 07:05:24 EDT 2009


Hello,

I think there is a severe server failure pacemaker doesn't detect. Over night 
a Lustre server failed in shrink_icache_memory() and probably it had a lock on 
dcache_lock. Now this is a global filesystem lock and when a filesystem fails 
while this is locked, any IO on this system just hangs. And I think pacemaker 
doesn't detect this failure. So DC was the failed node and of course, I 
couldn't login anymore, but ping still worked. On the other server crm_mon 
showed one failed resource (monitor), but it simply didn't do anything.

This is with pacemaker 1.04.

I think I should be able to reproduce this rather quickly, by adding a wrong 
dcache_lock into Lustre. The question is now how can we fix this in pacemaker? 


Thanks,
Bernd

-- 
Bernd Schubert
DataDirect Networks




More information about the Pacemaker mailing list