[Pacemaker] Problems when quorum lost for a short period of time

Wed Oct 2 08:40:33 UTC 2013

On 2013-10-02T09:26:26, Lev Sidorenko <levs at securemedia.co.nz> wrote:

> It is actually 2 nodes for main+stanby and another two nodes just for
> provide quorum.

Like Andrew wrote, a third node would be enough for that purpose.

You might as well run an iSCSI target on that node (instead of the full
cluster stack) and use sbd to provide fencing and a quorum protocol with
implied self-fencing.

> I have no-quorum-policy="stop"

If you want to be more tolerant of blips, you might consider changing
this to "freeze". Then you'll be fine - the surviving nodes will attain
quorum and fence the node if the issue persists.

> So, sometimes main node looses connection to the cluster and reports
> "quorum lost" but after 1-2 seconds connection re-establish and reports
> "quorum retained"

The main problem of course is this. *Why* are you losing network
connectivity so frequently that this is a problem? I assume you have
multiple network interfaces? (Which certainly are cheaper to get than
more nodes ...)

You should investigate and fix the underlaying problem.

You can also tweak the timeouts in corosync.conf.

Regards,
    Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde