[Pacemaker] Problems when quorum lost for a short period of time
Lev Sidorenko
levs at securemedia.co.nz
Thu Oct 3 19:25:30 UTC 2013
On Wed, 2013-10-02 at 10:40 +0200, Lars Marowsky-Bree wrote:
> On 2013-10-02T09:26:26, Lev Sidorenko <levs at securemedia.co.nz> wrote:
>
> > It is actually 2 nodes for main+stanby and another two nodes just for
> > provide quorum.
>
> Like Andrew wrote, a third node would be enough for that purpose.
>
> You might as well run an iSCSI target on that node (instead of the full
> cluster stack) and use sbd to provide fencing and a quorum protocol with
> implied self-fencing.
>
> > I have no-quorum-policy="stop"
>
> If you want to be more tolerant of blips, you might consider changing
> this to "freeze". Then you'll be fine - the surviving nodes will attain
> quorum and fence the node if the issue persists.
>
> > So, sometimes main node looses connection to the cluster and reports
> > "quorum lost" but after 1-2 seconds connection re-establish and reports
> > "quorum retained"
>
> The main problem of course is this. *Why* are you losing network
> connectivity so frequently that this is a problem? I assume you have
> multiple network interfaces? (Which certainly are cheaper to get than
> more nodes ...)
Yes, we are also investigating the network problem.
>
> You should investigate and fix the underlaying problem.
>
> You can also tweak the timeouts in corosync.conf.
I found here:
http://linux.die.net/man/5/corosync.conf
several option which seems to me can be used for increase timeout for
cluster to detect communication failure and triggers "no quorum", they
are:
- token
- merge
- fail_recv_const
- seqno_unchanged_const
- heartbeat_failures_allowed
- max_network_delay
- rrp_problem_count_timeout
- rrp_problem_count_threshold
- rrp_problem_count_mcast_threshold
What is better to use it that situation?
>
>
>
> Regards,
> Lars
>
More information about the Pacemaker
mailing list