[Pacemaker] Very strange behavior on asymmetric cluster

Pavel Levshin pavel at levshin.spb.ru
Fri Mar 18 10:07:07 EDT 2011

18.03.2011 12:44, Lars Marowsky-Bree пишет:
> messages can't be lost internally. A network dropping a packet doesn't
> result in lost state, because the protocols will resent it (of course).

Imagine a network where connectivity restores for a couple of seconds 
then breaks again. A node becomes online, monitoring actions are 
directed, then connectivity fails. This node will be unclean offline, then.

And fencing may fail in this scenario, too, because it usually relies on 
the network.

So every resource in the cluster will be impacted, were they installed 
on the failed node or not.

> You are right that we rely on the cluster stack itself to be healthy on
> a node; protecting against byzantine failure modes is extremely
> difficult.

I understand this fully. This is why we should keep core cluster stack 
to an absolutely necessary minimum. Resource agents and they 
dependencies are inevitable evil, where they are really needed.

But is my proposal so difficult to implement? I feel it is more 
ideologic difficulty. Meanwhile, it would prevent the scenario described 
above, among with others.
> If you have a RA where that doesn't work, it needs fixing. We also try
> to enlarge our test coverage - feel free to get invited to a free lunch
> ;-) http://www.advogato.org/person/lmb/diary.html?start=110

Do you also test every RA in such an environment where it's dependencies 
cannot be met? For example, VirtualDomain depends on libvirtd to be 
installed, running and operational. Will you find it useful to tesh this 
RA on a node which is not intended to run virtual machines, in all 
possible combinations of software and configurations?

> Which network faults do we not tolerate?

I hope I've answered this question already.

> Quorum nodes - i.e., a full pacemaker node - are a broken idea, and work
> around a deficiency in the cluster stack's quorum handling.

For now, a quorum node can prevent STONITH deathmatch, it is good enough 
for me. One of my clusters consists of three node: two for actual 
services and third for some offline services, which do not need 
redundancy. It is natural for me to setup third node as quorum node, 
i.e, a node, which will not participate in failover. But it may monitor 
it's own resources through pacemaker. Thus, simple quorum is almost free 
in this setup.

But it could be another cluster, for example, one sharing a cluster 
filesystem. So-called quorum node is only an example.

> In any case, the scenario you describe doesn't just happen. You could
> install a quorum node without any resource agents at all.

And it will not prevent a cluster from failures, caused by network or 
this node failure. Do you remember that a missing RA is treated as "not 
installed", but any

Andrew has suggested to not run pacemaker on quorum node. It would help. 
But in this case we cannot run even STONITH from this node, and this is 
one of possible drawbacks.

>> To be impartial, I would like to know what good do you see in the
>> current design, for the case of asymmetrical cluster or quorum node.
> Asymmetric clusters just invert the configuration syntax - it turns the
> resource-to-node allocation from "default allow" to "default deny".
> Depending on your requirements, that may simplify your configuration. It
> doesn't mean anything else.
I mean the kind of setup which makes usage of asymmetrical syntax 
practical. The setup where some resources exists on some nodes.

> Even if that is set, we need to verify that the resources are, indeed,
> NOT running where they shouldn't be; remember, it is our job to ensure
> that the configured policy is enforced. So, we probe them everywhere to
> ensure they are indeed not around, and stop them if we find them.

Again, WHY do you need to verify things which cannot happen by setup? If 
some resource cannot, REALLY CANNOT exist on a node, and administrator 
can confirm this, why rely on network, cluster stack, resource agents, 
electricity in power outlet, etc. to verify that 2+2 is still 4?

> The quorum node is a hack that I'm not a fan of; it adds much admin
> overhead (i.e., one 3rd node per cluster to manage, update, etc).
> Running a full member as "just" a quorum node is a complete waste and
> IMHO broken design.

You are right. I would not use a quorum node, if it wasn't here already.

>> What's good in checking a resource status on nodes where the
>> resource can not exist? What justifies increased resource downtime
>> caused by monitor failures, which are inevitable in real world?
> If you're saying that broken resource monitoring failures are
> inevitable in the real world, I could just as well ask you why they
> wouldn't fail on the actually active node too - and then we'd go into
> recovery as well (stopping&  restarting the service).

If a resource fails on active node, it will be acted upon. All this high 
availability is about this. If the resource fails on a standby node, I 
should accept this as clustering cost. And I should fix RA or 
configuration, if applicable. When the resource fails on a node where it 
can not exist, then it fails absolutely for nothing. It's unnecessary 
failure which could be avoided by simple configuration.

> The answer is: fix the resource agent.

I assume it is clear now that even perfect RA may fail to deliver it's 
message to the DC. Even nonexistent RA may fail, timeout and cause 
resource disruption.

But I also feel that it will be waste of time trying to repair RA which 
fails in conditions where it is not supposed to work.

> It is, let me state that again, completely infeasible to check (at
> runtime!) against all possible _internal_ failure modes. We check some
> of them, but all is just impossible. And if a component "lies" to the
> rest of the stack about its state, we're out of luck. We do hope that
> our stack is more reliable than the software its told to manage, yes ;-)

I've described a failure, which has happened in reality. I've proposed a 
fix, which can absolutely prevent this failure mode. This fix is to 
prevent the cluster from monitoring impossible cases. It will require 
some coding and some configuration to implement, but it will not 
introduce any new troubles, AFAIK. Do you have anything to say against?..

And side notes:

Testing is necessary, but it cannot be complete in finite time. Some 
bugs escape from test environment from time to time.

My cluster has survived 4 and failed on 5th real network failure. How 
many times should I test?

Pavel Levshin

More information about the Pacemaker mailing list