[Pacemaker] Very strange behavior on asymmetric cluster
pavel at levshin.spb.ru
Fri Mar 18 10:07:07 EDT 2011
18.03.2011 12:44, Lars Marowsky-Bree пишет:
> messages can't be lost internally. A network dropping a packet doesn't
> result in lost state, because the protocols will resent it (of course).
Imagine a network where connectivity restores for a couple of seconds
then breaks again. A node becomes online, monitoring actions are
directed, then connectivity fails. This node will be unclean offline, then.
And fencing may fail in this scenario, too, because it usually relies on
So every resource in the cluster will be impacted, were they installed
on the failed node or not.
> You are right that we rely on the cluster stack itself to be healthy on
> a node; protecting against byzantine failure modes is extremely
I understand this fully. This is why we should keep core cluster stack
to an absolutely necessary minimum. Resource agents and they
dependencies are inevitable evil, where they are really needed.
But is my proposal so difficult to implement? I feel it is more
ideologic difficulty. Meanwhile, it would prevent the scenario described
above, among with others.
> If you have a RA where that doesn't work, it needs fixing. We also try
> to enlarge our test coverage - feel free to get invited to a free lunch
> ;-) http://www.advogato.org/person/lmb/diary.html?start=110
Do you also test every RA in such an environment where it's dependencies
cannot be met? For example, VirtualDomain depends on libvirtd to be
installed, running and operational. Will you find it useful to tesh this
RA on a node which is not intended to run virtual machines, in all
possible combinations of software and configurations?
> Which network faults do we not tolerate?
I hope I've answered this question already.
> Quorum nodes - i.e., a full pacemaker node - are a broken idea, and work
> around a deficiency in the cluster stack's quorum handling.
For now, a quorum node can prevent STONITH deathmatch, it is good enough
for me. One of my clusters consists of three node: two for actual
services and third for some offline services, which do not need
redundancy. It is natural for me to setup third node as quorum node,
i.e, a node, which will not participate in failover. But it may monitor
it's own resources through pacemaker. Thus, simple quorum is almost free
in this setup.
But it could be another cluster, for example, one sharing a cluster
filesystem. So-called quorum node is only an example.
> In any case, the scenario you describe doesn't just happen. You could
> install a quorum node without any resource agents at all.
And it will not prevent a cluster from failures, caused by network or
this node failure. Do you remember that a missing RA is treated as "not
installed", but any
Andrew has suggested to not run pacemaker on quorum node. It would help.
But in this case we cannot run even STONITH from this node, and this is
one of possible drawbacks.
>> To be impartial, I would like to know what good do you see in the
>> current design, for the case of asymmetrical cluster or quorum node.
> Asymmetric clusters just invert the configuration syntax - it turns the
> resource-to-node allocation from "default allow" to "default deny".
> Depending on your requirements, that may simplify your configuration. It
> doesn't mean anything else.
I mean the kind of setup which makes usage of asymmetrical syntax
practical. The setup where some resources exists on some nodes.
> Even if that is set, we need to verify that the resources are, indeed,
> NOT running where they shouldn't be; remember, it is our job to ensure
> that the configured policy is enforced. So, we probe them everywhere to
> ensure they are indeed not around, and stop them if we find them.
Again, WHY do you need to verify things which cannot happen by setup? If
some resource cannot, REALLY CANNOT exist on a node, and administrator
can confirm this, why rely on network, cluster stack, resource agents,
electricity in power outlet, etc. to verify that 2+2 is still 4?
> The quorum node is a hack that I'm not a fan of; it adds much admin
> overhead (i.e., one 3rd node per cluster to manage, update, etc).
> Running a full member as "just" a quorum node is a complete waste and
> IMHO broken design.
You are right. I would not use a quorum node, if it wasn't here already.
>> What's good in checking a resource status on nodes where the
>> resource can not exist? What justifies increased resource downtime
>> caused by monitor failures, which are inevitable in real world?
> If you're saying that broken resource monitoring failures are
> inevitable in the real world, I could just as well ask you why they
> wouldn't fail on the actually active node too - and then we'd go into
> recovery as well (stopping& restarting the service).
If a resource fails on active node, it will be acted upon. All this high
availability is about this. If the resource fails on a standby node, I
should accept this as clustering cost. And I should fix RA or
configuration, if applicable. When the resource fails on a node where it
can not exist, then it fails absolutely for nothing. It's unnecessary
failure which could be avoided by simple configuration.
> The answer is: fix the resource agent.
I assume it is clear now that even perfect RA may fail to deliver it's
message to the DC. Even nonexistent RA may fail, timeout and cause
But I also feel that it will be waste of time trying to repair RA which
fails in conditions where it is not supposed to work.
> It is, let me state that again, completely infeasible to check (at
> runtime!) against all possible _internal_ failure modes. We check some
> of them, but all is just impossible. And if a component "lies" to the
> rest of the stack about its state, we're out of luck. We do hope that
> our stack is more reliable than the software its told to manage, yes ;-)
I've described a failure, which has happened in reality. I've proposed a
fix, which can absolutely prevent this failure mode. This fix is to
prevent the cluster from monitoring impossible cases. It will require
some coding and some configuration to implement, but it will not
introduce any new troubles, AFAIK. Do you have anything to say against?..
And side notes:
Testing is necessary, but it cannot be complete in finite time. Some
bugs escape from test environment from time to time.
My cluster has survived 4 and failed on 5th real network failure. How
many times should I test?
More information about the Pacemaker