[Pacemaker] Very strange behavior on asymmetric cluster

Fri Mar 18 05:44:23 EDT 2011

On 2011-03-17T23:14:45, Pavel Levshin <pavel at levshin.spb.ru> wrote:

> To be precise, the cluster will behave fine if the RA is missing,
> and cluster tries to "monitor", and all infrastructure works fine
> (so return code "not installed" is not lost somewhere as it have
> been in my case two weeks ago).

Hi Pavel,

messages can't be lost internally. A network dropping a packet doesn't
result in lost state, because the protocols will resent it (of course).

You are right that we rely on the cluster stack itself to be healthy on
a node; protecting against byzantine failure modes is extremely
difficult.

For resource agents, we rely on them being implemented properly, yes. We
supply "ocf-tester" to help validate that. There's many ways a resource
agent can screw us up - if it says "started" when the service is
actually down, we either won't start it or go into recovery mode; if the
"stop" doesn't work but says it does, we'll start the resources several
times and end up with concurrency violations. If it killed the wrong
process by accident, we'd also be not lucky.

We rely on "monitor" to be correctly implemented. Just like we rely on
the policy engine to be working fine, and other parts of the stack.

If you have a RA where that doesn't work, it needs fixing. We also try
to enlarge our test coverage - feel free to get invited to a free lunch
;-) http://www.advogato.org/person/lmb/diary.html?start=110

> So this cluster will not tolerate, for example, some network faults.
> It's a strange feature for a fault-tolerant cluster.

Which network faults do we not tolerate?

> Also this design conflicts with idea of "quorum node", which is not
> supposed to run resources. Quorum node, by it's existence only, may
> cause resource failure!

Quorum nodes - i.e., a full pacemaker node - are a broken idea, and work
around a deficiency in the cluster stack's quorum handling. 

In any case, the scenario you describe doesn't just happen. You could
install a quorum node without any resource agents at all.

But yes, if you install it with resource agents, and those resource
agents are broken, it can mess up your cluster. Just like that node
could mess up your network protocols. Again, we rely on the node's
cluster stack itself to be mostly healthy; anything else is
theoretically infeasible (if we go beyond certain timeouts).

> To be impartial, I would like to know what good do you see in the
> current design, for the case of asymmetrical cluster or quorum node.

Asymmetric clusters just invert the configuration syntax - it turns the
resource-to-node allocation from "default allow" to "default deny".
Depending on your requirements, that may simplify your configuration. It
doesn't mean anything else.

Even if that is set, we need to verify that the resources are, indeed,
NOT running where they shouldn't be; remember, it is our job to ensure
that the configured policy is enforced. So, we probe them everywhere to
ensure they are indeed not around, and stop them if we find them.

The quorum node is a hack that I'm not a fan of; it adds much admin
overhead (i.e., one 3rd node per cluster to manage, update, etc).
Running a full member as "just" a quorum node is a complete waste and
IMHO broken design.

We're working on that use case by a more powerful quorum framework, the
cluster token registry, multiple-device SBD, etc.

> What's good in checking a resource status on nodes where the
> resource can not exist? What justifies increased resource downtime
> caused by monitor failures, which are inevitable in real world?

If you're saying that broken resource monitoring failures are
inevitable in the real world, I could just as well ask you why they
wouldn't fail on the actually active node too - and then we'd go into
recovery as well (stopping & restarting the service).

The answer is: fix the resource agent.

> Is it a kind of automation? But it saves nothing, because cluster
> administrator if forced to delete unused RAs after every
> installation, upgrade and even some reconfigurations.

No. They have to fix them once so that they, indeed, don't report the
resource is active when it isn't. We do take patches.

> You are stating that RAs must be reliable. It is a good point. But
> even the Earth is not completely round, and any program may fail,
> even if it is bug-free, due to external problems. Are we concerned
> about fault tolerance and high availability? If so, then we should
> think of erroneous or disastrous conditions.

We cannot protect against all software failures. The primary use case
for the cluster stack is to protect against hardware failures; second,
protect against service crashes; third, help with adminstrative
automation.

It is, let me state that again, completely infeasible to check (at
runtime!) against all possible _internal_ failure modes. We check some
of them, but all is just impossible. And if a component "lies" to the
rest of the stack about its state, we're out of luck. We do hope that
our stack is more reliable than the software its told to manage, yes ;-)

Complete 100% fault tolerance is infeasible in the real world.

If you are that paranoid (which I find laudable! you're right, this all
_can_ happen!), the best answer is to deploy your updates first into a
staging cluster, which should be an as-exact-as-possible replica of your
production environment.

Simply because none of our testing can simulate _your_ specific
environment to 100%, either; so whatever we do (and we have pretty good
testing, all in all), you may find bugs we miss. If you do, a staging
cluster is an excellent way of filing bug reports and verifying the
fixes.

Regards,
    Lars

-- 
Architect Storage/HA, OPS Engineering, Novell, Inc.
SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde