[Pacemaker] [DRBD-user] DRBD active/passive on Pacemaker+CMAN cluster unexpectedly performs STONITH when promoting

Mon Jul 7 23:02:41 UTC 2014

> Date: Mon, 7 Jul 2014 10:04:20 +0200
> From: lars.ellenberg at linbit.com
> To: drbd-user at lists.linbit.com; pacemaker at oss.clusterlabs.org
> Subject: Re: [Pacemaker] [DRBD-user] DRBD active/passive on Pacemaker+CMAN cluster unexpectedly performs STONITH when promoting
> 
> On Fri, Jul 04, 2014 at 06:04:12PM +0200, Giuseppe Ragusa wrote:
> > > > The setup "almost" works (all seems ok with: "pcs status", "crm_mon
> > > > -Arf1", "corosync-cfgtool -s", "corosync-objctl | grep member") , but
> > > > every time it needs a resource promotion (to Master, i.e. becoming
> > > > primary) it either fails or fences the other node (the one supposed to
> > > > become Slave i.e. secondary) and only then succeeds.
> > > >
> > > > It happens, for example both on initial resource definition (when
> > > > attempting first start) and on node entering standby (when trying to
> > > > automatically move the resources by stopping then starting them).
> > > > 
> > > > I collected a full "pcs cluster report" and I can provide a CIB dump,
> > > > but I will initially paste here an excerpt from my configuration just
> > > > in case it happens to be a simple configuration error that someone can
> > > > spot on the fly ;> (hoping...)
> > > > 
> > > > Keep in mind that the setup has separated redundant network
> > > > connections for LAN (1 Gib/s LACP to switches), Corosync (1 Gib/s
> > > > roundrobin back-to-back) and DRBD (10 Gib/s roundrobin back-to-back)
> > > > and that FQDNs are correctly resolved through /etc/hosts
> > > 
> > > Make sure youre DRBD are "Connected UpToDate/UpToDate"
> > > before you let the cluster take over control of who is master.
> > 
> > Thanks for your important reminder.
> > 
> > Actually they had been "Connected UpToDate/UpToDate", and I subsequently had all manually demoted to secondary
> > then down-ed before eventually stopping the (manually started) DRBD service.
> > 
> > Only at the end did I start/configure the cluster.
> > 
> > The problem is now resolved and it seems that my improper use of
> > rhcs_fence as fence-peer was the culprit (now switched to
> > crm-fence-peer.sh), but I still do not understand why rhcs_fence was
> > called at all in the beginning (once called, it may have caused
> > unforeseen consequences, I admit) since DRBD docs clearly state that
> > communication disruption must be involved in order to call fence-peer
> > into action.
> 
> You likely managed to have data divergence
> between your instances of DRBD,
> likely caused by a cluster split-brain.

I'm quite positively sure that no communication disruption happened (both on the 2 RRP Corosync links and the separated DRBD link) up to the moment when I committed/started  the first DRBD/KVM resources, but maybe something else interfered (I'm repeating myself, but I want to stress that after  manually stopping the still-not-clusterd VM I manually made secondary the primary DRBD resource and then brought both sides "drbdadm down", but those seem innocent acts too).

> So DRBD would refuse to connect,
> and thus would be not connected when promoted.

The "not connected when promoted" seems the crucial part, whatever the reason behind.
As I said, I noticed an apparent "death" of cluster components (cluster not connected as output of "pcs status") on the victim node mere seconds before being shot.

> Just because you can shoot someone
> does not make your data any better,
> nor does it tell the victim node that his data is "bad"
> (from the shooting nodes point of view)
> so they would just keep killing each other then.

This is absolutely clear to me.
Thanks anyway for pointing it out.

> "Don't do that."

:)

> But tell the cluster to not even attempt to promote,
> unless the local data is known to be UpToDate *and*
> the remote data is either known (DRBD is connected)
> or the remote date is known to be bad (Outdated or worse).
> 
> the ocf:linbit:drbd agent has an "adjust master scores"
> parameter for that. See there.

so maybe setting it to "0 10 1000 10000" (from the default of "5 10 1000 10000") could be enough, if I understood it right, given that I already have resource-and-stonith + crm-fence-peer.sh (and the stonith part is tested), but what would I gain from this (maybe spare me from unnecessary stonith?), given that no actual split brain happened and everything resolved "by itself"? 
After restarting the shot node, it came up allright and eventually (by simply changing from rhcs_fence to crm-fence-peer.sh and reloading configuration) everything worked as expected (by me), with new resources brought up without stonith involved and resource moving working as expected.

If you think that what happened could reveal some bugs/misbehaviour, I can privately send you full unedited logs/reports.

Many thanks again for all the help and explanations.

Regards,
Giuseppe Ragusa

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140708/255e75fe/attachment.htm>