[Pacemaker] [DRBD-user] DRBD active/passive on Pacemaker+CMAN cluster unexpectedly performs STONITH when promoting
Lars Ellenberg
lars.ellenberg at linbit.com
Mon Jul 7 08:04:20 UTC 2014
On Fri, Jul 04, 2014 at 06:04:12PM +0200, Giuseppe Ragusa wrote:
> > > The setup "almost" works (all seems ok with: "pcs status", "crm_mon
> > > -Arf1", "corosync-cfgtool -s", "corosync-objctl | grep member") , but
> > > every time it needs a resource promotion (to Master, i.e. becoming
> > > primary) it either fails or fences the other node (the one supposed to
> > > become Slave i.e. secondary) and only then succeeds.
> > >
> > > It happens, for example both on initial resource definition (when
> > > attempting first start) and on node entering standby (when trying to
> > > automatically move the resources by stopping then starting them).
> > >
> > > I collected a full "pcs cluster report" and I can provide a CIB dump,
> > > but I will initially paste here an excerpt from my configuration just
> > > in case it happens to be a simple configuration error that someone can
> > > spot on the fly ;> (hoping...)
> > >
> > > Keep in mind that the setup has separated redundant network
> > > connections for LAN (1 Gib/s LACP to switches), Corosync (1 Gib/s
> > > roundrobin back-to-back) and DRBD (10 Gib/s roundrobin back-to-back)
> > > and that FQDNs are correctly resolved through /etc/hosts
> >
> > Make sure youre DRBD are "Connected UpToDate/UpToDate"
> > before you let the cluster take over control of who is master.
>
> Thanks for your important reminder.
>
> Actually they had been "Connected UpToDate/UpToDate", and I subsequently had all manually demoted to secondary
> then down-ed before eventually stopping the (manually started) DRBD service.
>
> Only at the end did I start/configure the cluster.
>
> The problem is now resolved and it seems that my improper use of
> rhcs_fence as fence-peer was the culprit (now switched to
> crm-fence-peer.sh), but I still do not understand why rhcs_fence was
> called at all in the beginning (once called, it may have caused
> unforeseen consequences, I admit) since DRBD docs clearly state that
> communication disruption must be involved in order to call fence-peer
> into action.
You likely managed to have data divergence
between your instances of DRBD,
likely caused by a cluster split-brain.
So DRBD would refuse to connect,
and thus would be not connected when promoted.
Just because you can shoot someone
does not make your data any better,
nor does it tell the victim node that his data is "bad"
(from the shooting nodes point of view)
so they would just keep killing each other then.
"Don't do that."
But tell the cluster to not even attempt to promote,
unless the local data is known to be UpToDate *and*
the remote data is either known (DRBD is connected)
or the remote date is known to be bad (Outdated or worse).
the ocf:linbit:drbd agent has an "adjust master scores"
parameter for that. See there.
Lars
--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com
DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list -- I'm subscribed
More information about the Pacemaker
mailing list