[Pacemaker] configuration variants for 2 node cluster

Tue Jun 24 08:51:49 EDT 2014

Chrissie,

I don't wont to reinvent a quorum disk =)
I know about its complexity.
That's why I think that the most reasonable decision for me is to wait till
Corosync 2 gets quorum disk :)
But meanwhile I need to deal somehow with my situation.
So, the possible solution for me is creating a daemon, which will start
cluster stack based on some circumstances.

Here is how I see it (any improvements are appreciated):

The marker: SCSI reservation of SSD
IMPORTANT: The daemon should distinguish which node marker belongs to.
QUESTION: What other markers is it possible to use?

--------------
Main workflow:
--------------
1. Node start
2. Daemon start
    2.1. Check the marker. Is marker present?
        NO:
            2.1.1. Set marker. Successful?
                NO: Do nothing. (Go to 2.1 and repeat it for few times).
                YES: Start cluster stack.
        YES:
            2.1.2. Ping the other node. Successful?
                NO: Do nothing: the other node is probably (99%) on.
                YES:
                    Remove the marker.
                    Start cluster stack.[*]
                    P.S.: In case cluster won't establish connection with
the other node, fencing agent on this node is triggered and will fence the
other node (can be fence loop but we can minimize possibility of it[1]).

----------------------
Split brain situation:
----------------------
1. Fencing agent tries to set the marker. Successful?
    NO: Do nothing: this node is gonna be fenced. Meanwhile this node can
be put in standby mode while waiting for fencing.
    YES: STONITH (reboot) the other node. Marker is kept.

---------
Benefits:
---------
Even after reboot, one of the nodes still starts cluster stack - the one
that the marker belongs to.

------------------
Possible problems:
------------------
If the node, that the marker belongs to, is not working, we need to force
run cluster stack on the other node.
It requires human interaction.

=====================
* In case ping is successful but cluster doesn't see the other node (is it
even possible?) we can do the next:
    a. Daemon starts Corosync.
    b. Gets a list of nodes and ensures that the other node is present
there. This is the guarantee that the nodes are seeing each other in the
cluster.
    c. Starts Pacemaker.

Thank you,
Kostya

On Tue, Jun 24, 2014 at 11:44 AM, Christine Caulfield <ccaulfie at redhat.com>
wrote:

> On 24/06/14 09:36, Kostiantyn Ponomarenko wrote:
>
>> Hi Chrissie,
>>
>> But wait_for_all doesn't help when there is no connection between the
>> nodes.
>> Because in case I need to reboot the remaining working node I won't get
>> working cluster after that - both nodes will be waiting connection
>> between them.
>> That's why I am looking for the solution which could help me to get one
>> node working in this situation (after reboot).
>> I've been thinking about some kind of marker which could help a node to
>> determine a state of the other node.
>> Like external disk and SCSI reservation command. Maybe you could suggest
>> another kind of marker?
>> I am not sure can we use a presents of a file on external SSD as the
>> marker. Kind of: if there is a file - the other node is alive, if no -
>> node is dead.
>>
>>
> More seriously, that solution is harder than it might seem - which is one
> reason qdiskd was as complex as it became, and why votequorum is as
> conservative as it is when it comes to declaring a workable cluster. If
> someone is there to manually reboot nodes then it might be as well for a
> human decision to be made about which one is capable of running services.
>
> Chrissie
>
>  Digimer,
>>
>> Thanks for the links and information.
>> Anyway if I go this way, I will write my own daemon to determine a state
>> of the other node.
>> Also the information about fence loop is new for me, thanks =)
>>
>> Thank you,
>> Kostya
>>
>>
>> On Tue, Jun 24, 2014 at 10:55 AM, Christine Caulfield
>> <ccaulfie at redhat.com <mailto:ccaulfie at redhat.com>> wrote:
>>
>>     On 23/06/14 15:49, Digimer wrote:
>>
>>         Hi Kostya,
>>
>>             I'm having a little trouble understanding your question,
>> sorry.
>>
>>             On boot, the node will not start anything, so after booting
>>         it, you
>>         log in, check that it can talk to the peer node (a simple ping is
>>         generally enough), then start the cluster. It will join the peer's
>>         existing cluster (even if it's a cluster on just itself).
>>
>>             If you booted both nodes, say after a power outage, you will
>>         check
>>         the connection (again, a simple ping is fine) and then start the
>>         cluster
>>         on both nodes at the same time.
>>
>>
>>
>>     wait_for_all helps with most of these situations. If a node goes
>>     down then it won't start services until it's seen the non-failed
>>     node because wait_for_all prevents a newly rebooted node from doing
>>     anything on its own. This also takes care of the case where both
>>     nodes are rebooted together of course, because that's the same as a
>>     new start.
>>
>>     Chrissie
>>
>>
>>             If one of the nodes needs to be shut down, say for repairs or
>>         upgrades, you migrate the services off of it and over to the
>>         peer node,
>>         then you stop the cluster (which tells the peer that the node is
>>         leaving
>>         the cluster). After that, the remaining node operates by itself.
>>         When
>>         you turn it back on, you rejoin the cluster and migrate the
>>         services back.
>>
>>             I think, maybe, you are looking at things more complicated
>>         than you
>>         need to. Pacemaker and corosync will handle most of this for
>>         you, once
>>         setup properly. What operating system do you plan to use, and what
>>         cluster stack? I suspect it will be corosync + pacemaker, which
>>         should
>>         work fine.
>>
>>         digimer
>>
>>         On 23/06/14 10:36 AM, Kostiantyn Ponomarenko wrote:
>>
>>             Hi Digimer,
>>
>>             Suppose I disabled to cluster on start up, but what about
>>             remaining
>>             node, if I need to reboot it?
>>             So, even in case of connection lost between these two nodes
>>             I need to
>>             have one node working and providing resources.
>>             How did you solve this situation?
>>             Should it be a separate daemon which checks somehow
>>             connection between
>>             the two nodes and decides to run corosync and pacemaker or
>>             to keep them
>>             down?
>>
>>             Thank you,
>>             Kostya
>>
>>
>>             On Mon, Jun 23, 2014 at 4:34 PM, Digimer <lists at alteeve.ca
>>             <mailto:lists at alteeve.ca>
>>             <mailto:lists at alteeve.ca <mailto:lists at alteeve.ca>>> wrote:
>>
>>                  On 23/06/14 09:11 AM, Kostiantyn Ponomarenko wrote:
>>
>>                      Hi guys,
>>                      I want to gather all possible configuration
>>             variants for 2-node
>>                      cluster,
>>                      because it has a lot of pitfalls and there are not
>>             a lot of
>>                      information
>>                      across the internet about it. And also I have some
>>             questions
>>             about
>>                      configurations and their specific problems.
>>                      VARIANT 1:
>>                      -----------------
>>                      We can use "two_node" and "wait_for_all" option
>>             from Corosync's
>>                      votequorum, and set up fencing agents with delay on
>>             one of them.
>>                      Here is a workflow(diagram) of this configuration:
>>                      1. Node start.
>>                      2. Cluster start (Corosync and Pacemaker) at the
>>             boot time.
>>                      3. Wait for all nodes. All nodes joined?
>>                            No. Go to step 3.
>>                            Yes. Go to step 4.
>>                      4. Start resources.
>>                      5. Split brain situation (something with connection
>>             between
>>             nodes).
>>                      6. Fencing agent on the one of the nodes reboots
>>             the other node
>>                      (there
>>                      is a configured delay on one of the Fencing agents).
>>                      7. Rebooted node go to step 1.
>>                      There are two (or more?) important things in this
>>             configuration:
>>                      1. Rebooted node remains waiting for all nodes to
>>             be visible
>>                      (connection
>>                      should be restored).
>>                      2. Suppose connection problem still exists and the
>>             node which
>>                      rebooted
>>                      the other guy has to be rebooted also (for some
>>             reasons). After
>>                      reboot
>>                      he is also stuck on step 3 because of connection
>>             problem.
>>                      QUESTION:
>>                      -----------------
>>                      Is it possible somehow to assign to the guy who won
>>             the reboot
>>             race
>>                      (rebooted other guy) a status like a "primary" and
>>             allow him not
>>                      to wait
>>                      for all nodes after reboot. And neglect this status
>>             after
>>             other node
>>                      joined this one.
>>                      So is it possible?
>>                      Right now that's the only configuration I know for
>>             2 node
>>             cluster.
>>                      Other variants are very appreciated =)
>>                      VARIANT 2 (not implemented, just a suggestion):
>>                      -----------------
>>                      I've been thinking about using external SSD drive
>>             (or other
>>             external
>>                      drive). So for example fencing agent can reserve
>>             SSD using SCSI
>>                      command
>>                      and after that reboot the other node.
>>                      The main idea of this is the first node, as soon as
>>             a cluster
>>                      starts on
>>                      it, reserves SSD till the other node joins the
>>             cluster, after
>>                      that SCSI
>>                      reservation is removed.
>>                      1. Node start
>>                      2. Cluster start (Corosync and Pacemaker) at the
>>             boot time.
>>                      3. Reserve SSD. Did it manage to reserve?
>>                            No. Don't start resources (Wait for all).
>>                            Yes. Go to step 4.
>>                      4. Start resources.
>>                      5. Remove SCSI reservation when the other node has
>>             joined.
>>                      5. Split brain situation (something with connection
>>             between
>>             nodes).
>>                      6. Fencing agent tries to reserve SSD. Did it
>>             manage to reserve?
>>                            No. Maybe puts node in standby mode ...
>>                            Yes. Reboot the other node.
>>                      7. Optional: a single node can keep SSD reservation
>>             till he is
>>                      alone in
>>                      the cluster or till his shut-down.
>>                      I am really looking forward to find the best
>>             solution (or a
>>                      couple of
>>                      them =)).
>>                      Hope I am not the only person ho is interested in
>>             this topic.
>>
>>
>>                      Thank you,
>>                      Kostya
>>
>>
>>                  Hi Kostya,
>>
>>                     I only build 2-node clusters, and I've not had
>>             problems with this
>>                  going back to 2009 over dozens of clusters. The tricks
>>             I found are:
>>
>>                  * Disable quorum (of course)
>>                  * Setup good fencing, and add a delay to the node you
>>             you prefer (or
>>                  pick one at random, if equal value) to avoid dual-fences
>>                  * Disable to cluster on start up, to prevent fence loops.
>>
>>                     That's it. With this, your 2-node cluster will be
>>             just fine.
>>
>>                     As for your question; Once a node is fenced
>>             successfully, the
>>                  resource manager (pacemaker) will take over any
>>             services lost on the
>>                  fenced node, if that is how you configured it. A node
>>             the either
>>                  gracefully leaves or dies/fenced should not interfere
>>             with the
>>                  remaining node.
>>
>>                     The problem is when a node vanishes and fencing
>>             fails. Then, not
>>                  knowing what the other node might be doing, the only
>>             safe option is
>>                  to block, otherwise you risk a split-brain. This is why
>>             fencing is
>>                  so important.
>>
>>                  Cheers
>>
>>                  --
>>                  Digimer
>>                  Papers and Projects: https://alteeve.ca/w/
>>                  What if the cure for cancer is trapped in the mind of a
>>             person
>>                  without access to education?
>>
>>                  ___________________________________________________
>>
>>                  Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>             <mailto:Pacemaker at oss.clusterlabs.org>
>>                  <mailto:Pacemaker at oss.__clusterlabs.org
>>             <mailto:Pacemaker at oss.clusterlabs.org>>
>>             http://oss.clusterlabs.org/____mailman/listinfo/pacemaker
>>             <http://oss.clusterlabs.org/__mailman/listinfo/pacemaker>
>>
>>             <http://oss.clusterlabs.org/__mailman/listinfo/pacemaker
>>             <http://oss.clusterlabs.org/mailman/listinfo/pacemaker>>
>>
>>                  Project Home: http://www.clusterlabs.org
>>                  Getting started:
>>             http://www.clusterlabs.org/____doc/Cluster_from_Scratch.pdf
>>             <http://www.clusterlabs.org/__doc/Cluster_from_Scratch.pdf>
>>
>>
>>             <http://www.clusterlabs.org/__doc/Cluster_from_Scratch.pdf
>>             <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>>
>>                  Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>>             _________________________________________________
>>             Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>             <mailto:Pacemaker at oss.clusterlabs.org>
>>             http://oss.clusterlabs.org/__mailman/listinfo/pacemaker
>>             <http://oss.clusterlabs.org/mailman/listinfo/pacemaker>
>>
>>             Project Home: http://www.clusterlabs.org
>>             Getting started:
>>             http://www.clusterlabs.org/__doc/Cluster_from_Scratch.pdf
>>             <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>             Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>>
>>     _________________________________________________
>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>     <mailto:Pacemaker at oss.clusterlabs.org>
>>     http://oss.clusterlabs.org/__mailman/listinfo/pacemaker
>>     <http://oss.clusterlabs.org/mailman/listinfo/pacemaker>
>>
>>     Project Home: http://www.clusterlabs.org
>>     Getting started:
>>     http://www.clusterlabs.org/__doc/Cluster_from_Scratch.pdf
>>     <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>     Bugs: http://bugs.clusterlabs.org
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140624/e6d9f9fb/attachment-0003.html>