[Pacemaker] server lockup failures

Andrew Beekhof andrew at beekhof.net
Wed Oct 28 15:15:08 EDT 2009


On Wed, Oct 28, 2009 at 2:44 PM, Bernd Schubert
<bs_lists at aakef.fastmail.fm> wrote:
> On Wednesday 28 October 2009, Andrew Beekhof wrote:
>> On Wed, Oct 28, 2009 at 1:05 PM, Bernd Schubert
>>
>> <bs_lists at aakef.fastmail.fm> wrote:
>> > Hello,
>> >
>> > I think there is a severe server failure pacemaker doesn't detect. Over
>> > night a Lustre server failed in shrink_icache_memory() and probably it
>> > had a lock on dcache_lock. Now this is a global filesystem lock and when
>> > a filesystem fails while this is locked, any IO on this system just
>> > hangs.
>>
>> And the FS in question was / so Pacemaker basically hung?
>
> I couldn't login any more, but my guess is 'yes it hung'. But no, it was not
> the root (/) FS. But if any FS crashes while it holds dcache_lock, any other
> filesystem will hang as well.

ooohhhhh

> There is nothing we can do about that except of
> rewriting the linux vfs ;) My question is just what can we do to get Pacemaker
> fixed to stonith that node.

Hmmm.  Was this an openais or heartbeat based cluster?
If all the processes hung I'd have expected it to drop out of the
membership list and get shot by the new DC...




More information about the Pacemaker mailing list