[Pacemaker] drbd on heartbeat links

Tue Nov 2 21:57:35 UTC 2010

On Tue, Nov 02, 2010 at 10:07:17PM +0100, Pavlos Parissis wrote:
> On 2 November 2010 16:15, Dan Frincu <dfrincu at streamwide.ro> wrote:
> > Hi,
> >
> > Pavlos Parissis wrote:
> >>
> >> Hi,
> >>
> >> I am trying to figure out how I can resolve the following scenario
> >>
> >> Facts
> >> 3 nodes
> >> 2 DRBD ms resource
> >> 2 group resource
> >> by default drbd1/group1 runs on node-01 and drbd2/group2 runs on node2
> >> drbd1/group1  can only run on node-01 and node-03
> >> drbd2/group2  can only run on node-02 and node-03
> >> DRBD fencing_policy is resource-only [1]
> >> 2 heartbeat links and one of them used by DRBD communication
> >>
> >> Scenario
> >> 1) node-01 loses both heartbeat links
> >> 2) DRBD monitor detects first the absence of the drbd communication
> >> and does resource fencing by add location constraint which prevent
> >> drbd1 to run on node3
> >> 3) pacemaker fencing kicks in and kills node-01
> >>
> >> due to location constraint created at step 2, drbd1/group1 can run in
> >> the cluster
> >>
> >>
> >
> > I don't understand exactly what you mean by this. Resource-only fencing
> > would create a -inf score on node1 when the node loses the drbd
> > communication channel (the only one drbd uses),
> Because node-01 is the primary at the moment of the failure,
> resource-fencing will create an -inf score for the node-03.
> 
> > however you could still have
> > heartbeat communication available via the secondary link, then you shouldn't
> As I wrote none of the heartbeat links is available.
> After I sent the mail, I realized that the node-03 will not see
> location constraint created by node-01 because there no heartbeat
> communication!
> Thus I think my scenario has a flaw, since none of the heartbeat links
> are available on node-01.
> Resource-fencing from DRBD will be triggered but without any effect
> and node-03 or node-02 will fence node-01, and node-03 will be become
> the primary for drbd1
> 
> > fence the entire node, the resource-only fencing does that for you, the only
> > thing you need to do is to add the drbd fence handlers in /etc/drbd.conf.
> >       handlers {
> >               fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> >               after-resync-target "/usr/lib/drbd/crm-unfence-peer.sh";
> >       }
> >
> > Is this what you meant?
> 
> No.
> Dan thanks for your mail.
> 
> 
> Since there is a flaw on the scenario let's define a similar scenario.
> 
> status
> node-01 primary for drbd1 and group1 runs on it
> node-02 primary for drbd2 and group2 runs on it
> node-3 secondary for drbd1 and drbd2
> 
> 2 heartbeat links, and one of them being used for DRBD communication
> 
> here is the scenario
> 1) on node-01 heartbeat link which carries also DRBD communication is lost
> 2) node-01 does resource-fencing and places score -inf for drbd1 on node-03
> 3) on node-01 second heartbeat link is lost
> 4) node-01 will be fenced by one other cluster members
> 5) drbd1 can't run on node-03 due to location constraint created at step 2
> 
> The problem here is that location constraint will be active even
> node-01 is fenced.

Which is good, and intended behaviour, as it protects you from
going online with stale data (changes between 1) and 4) would be lost).

> Any ideas?

The drbd setting "resource-and-stonith" simply tells DRBD
that you have stonith configured in your cluster.
It does not by itself trigger any stonith action.

So if you have stonith enabled, and you want to protect against being
shot while modifying data, you should say "resource-and-stonith".

What exactly do you want to solve?

Either you want to avoid going online with stale data,
so you place that contraint, or use dopd, or some similar mechanism.

Or you don't care, so you don't use those fencing scripts.

Or you usually are in a situation where you not want to use stale data,
but suddenly your primary data copy is catastrophically lost, and the
(slightly?) stale other copy is the best you have.

Then you remove the constraint or force drbd primary, or both.
This should not be outomated, as it involves knowledge the cluster
cannot have, thus cannot base decisions on.

So again,

What is it you are trying to solve?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed