[Pacemaker] WARN: ..... unmanaged failed resources cannot prevent clone shutdown

Wed Jul 6 12:26:57 UTC 2011

On 2011-07-05 00:24, Andrew Beekhof wrote:
> On Fri, Jul 1, 2011 at 9:23 PM, Andreas Kurz <andreas.kurz at linbit.com> wrote:
>> Hello,
>>
>> In a cluster without stonith enabled (yes I know ....) the monitor
>> failure of one resource followed by the stop failure of a dependent
>> resource lead to a cascade of errors especially because the cluster did
>> not stop the shutdown sequence on stop (timeout) failures:
>>
>> WARN: should_dump_input: Ignoring requirement that
>> resource_fs_home_stop_0 comeplete before ms_drbd_home_demote_0:
>> unmanaged failed resources cannot prevent clone shutdown
>>
>> ... and that is really ugly in a DRBD Environment, because demote/stop
>> will not work when the DRBD device is in use -- so in this case this
>> order requirement on stop must not be ignored.
> 
> Did you ask the cluster to shut down before or after the first resource failed?

Neither the first nor the latter ...

* IP resource had a monitor failure
* restart triggered a restart of all dependent resources
* one resource had an stop failure
* cluster decided that this failed resource must move away
* cluster decided to move IP to second node and therefore all dependant
resource have to follow
* cascading stop begins
* two file systems where unable to umount --> stop failure/unmanaged,
ignored

... and now the really ugly things happened as demote and stop on DRBD
ms resources were triggered although the files system were still online.

Furthermore the cluster additionally tried to promote DRBD on the second
node which is also impossible if the other side is not demoted/stopped.

Of course there are clones/ms resources that can stop independent of
their dependent resources but DRBD is one that can't.

So I think there should be a way to tell a clone/ms resource do _not_
ignore the order requirements on stop failures of dependent resources.

> 
>> The result were a lot of unmanaged resources and the cluster even tried
>> to promote the MS resource on the other node although the second
>> instance was neither demoted nor stopped.
> 
> We seem to loose either way.
> If we have the cluster block people complain shutdown takes too long.

This seems to be sensible for a cluster shutdown, but I don't think this
a good behavior on resource migration. The cluster was really in a heavy
mess after this stop cascade.

I would expect the cluster to block on the first stop error and wait for
manual intervention if no fencing is configured.

> 
> Basically at the point a resource fails and stonith is not configured
> - shutdown is best-effort.

I agree, shutdown of the failed resource can be best-effort ..
especially on cluster shutdown ... even then I'd like to have a choice.

I don't agree that ignoring the stop order requirements is also
best-effort on "simple" resource stop/migration ... I'd like to tweak my
resource to insist on stop order for dependent resources.

Thx & Regards,
Andreas

> 
>>
>> Is there any possibility to tune this behavior?
>>
>> Even with stonith enabled the cluster would first migrate all resources
>> that don't depend on the unmanaged(failed) resource away before
>> executing the stonith, am I right?
>>
>> thx & Regards,
>> Andreas
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>>
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 294 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20110706/d3df774c/attachment-0004.sig>