[Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power loss

Wed Oct 24 10:03:36 EDT 2012

Hi Andreas,

----- Original Message -----
> From: "Andreas Kurz" <andreas at hastexo.com>
> To: pacemaker at oss.clusterlabs.org
> Sent: Wednesday, October 24, 2012 4:13:03 AM
> Subject: Re: [Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power	loss
> 
> On 10/23/2012 05:04 PM, Andrew Martin wrote:
> > Hello,
> > 
> > Under the Clusters from Scratch documentation, allow-two-primaries
> > is
> > set in the DRBD configuration for an active/passive cluster:
> > http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_write_the_drbd_config
> > 
> > "TODO: Explain the reason for the allow-two-primaries option"
> > 
> > Is the reason for allow-two-primaries in this active/passive
> > cluster
> > (using ext4, a non-cluster filesystem) to allow for failover in the
> > type
> > of situation I have described (where the old primary/master is
> > suddenly
> > offline like with a power supply failure)? Are split-brains
> > prevented
> > because Pacemaker ensures that only one node is promoted to Primary
> > at
> > any time?
> 
> no "allow-two-primaries" needed in an active/passive setup, the
> fence-handler (executed on the Primary if connection to Secondary is
> lost) inserts a location-constraint into the Pacemaker configuration
> so
> the cluster does not even "think about" to promote an outdated
> Secondary
> 
> > 
> > Is it possible to recover from such a failure without
> > allow-two-primaries?
> 
> Yes. If you only disconnect DRBD as in you test described below and
> cluster communication over redundant network is still possible (and
> Pacemaker is up and running), the Primary will insert that
> location-constraint and prevents a Secondary from becoming Primary
> because the constraint is already placed ... if Pacemaker is _not_
> running during your disconnection test, you also receive an error
> because obviously it is also impossible to place that constraint.
> 

What about the situation where the primary, node0, is running alone fine but then its power supply fails (or a kernel panic, or some other critical hardware failure), resulting in it instantly being shut off? The resources should failover to the secondary node, node1, however node1's DRBD device will have the following state:

Role:
Secondary/Unknown

Disk State:
UpToDate/DUnknown

Connection State:
WFConnection

DRBD will refuse to allow this node to be promoted to primary:
0: State change failed: (-7) Refusing to be Primary while peer is not outdated
Command 'drbdsetup 0 primary' terminated with exit code 11

Does Pacemaker have some mechanism for (on node1) being able to outdate node0, the old master/primary, in order to promote the DRBD resource?

Thanks,

Andrew

> Regards,
> Andreas
> 
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
> > 
> > Thanks,
> > 
> > Andrew
> > 
> > ------------------------------------------------------------------------
> > *From: *"Andrew Martin" <amartin at xes-inc.com>
> > *To: *"The Pacemaker cluster resource manager"
> > <pacemaker at oss.clusterlabs.org>
> > *Sent: *Friday, October 19, 2012 10:45:04 AM
> > *Subject: *[Pacemaker] Behavior of Corosync+Pacemaker with DRBD
> > primary
> > power        loss
> > 
> > Hello,
> > 
> > I have a 3 node Pacemaker + Corosync cluster with 2 "real" nodes,
> > node0
> > and node1, running a DRBD resource (single-primary) and the 3rd
> > node in
> > standby acting as a quorum node. If node0 were running the DRBD
> > resource, and thus is DRBD primary, and its power supply fails,
> > will the
> > DRBD resource be promoted to primary on node1?
> > 
> > If I simply cut the DRBD replication link, node1 reports the
> > following
> > state:
> > Role:
> > Secondary/Unknown
> > 
> > Disk State:
> > UpToDate/DUnknown
> > 
> > Connection State:
> > WFConnection
> > 
> > 
> > I cannot manually promote the DRBD resource because the peer is not
> > outdated:
> > 0: State change failed: (-7) Refusing to be Primary while peer is
> > not
> > outdated
> > Command 'drbdsetup 0 primary' terminated with exit code 11
> > 
> > I have configured the CIB-based crm-fence-peer.sh utility in my
> > drbd.conf
> > fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> > but I do not believe it would be applicable in this scenario.
> > 
> > If node0 goes offline like this and doesn't come back (e.g. after a
> > STONITH), does Pacemaker have a way to tell node1 that its peer is
> > outdated and to proceed with promoting the resource to primary?
> > 
> > Thanks,
> > 
> > Andrew
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started:
> > http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> > 
> 
> 
> 
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>