[Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power loss

Tue Oct 23 15:04:31 UTC 2012

Hello, 

Under the Clusters from Scratch documentation, allow-two-primaries is set in the DRBD configuration for an active/passive cluster: 
http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html-single/Clusters_from_Scratch/index.html#_write_the_drbd_config 

"TODO: Explain the reason for the allow-two-primaries option" 

Is the reason for allow-two-primaries in this active/passive cluster (using ext4, a non-cluster filesystem) to allow for failover in the type of situation I have described (where the old primary/master is suddenly offline like with a power supply failure)? Are split-brains prevented because Pacemaker ensures that only one node is promoted to Primary at any time? 

Is it possible to recover from such a failure without allow-two-primaries? 

Thanks, 

Andrew 

----- Original Message -----

From: "Andrew Martin" <amartin at xes-inc.com> 
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
Sent: Friday, October 19, 2012 10:45:04 AM 
Subject: [Pacemaker] Behavior of Corosync+Pacemaker with DRBD primary power loss 

Hello, 

I have a 3 node Pacemaker + Corosync cluster with 2 "real" nodes, node0 and node1, running a DRBD resource (single-primary) and the 3rd node in standby acting as a quorum node. If node0 were running the DRBD resource, and thus is DRBD primary, and its power supply fails, will the DRBD resource be promoted to primary on node1? 

If I simply cut the DRBD replication link, node1 reports the following state: 
Role: 
Secondary/Unknown 

Disk State: 
UpToDate/DUnknown 

Connection State: 
WFConnection 

I cannot manually promote the DRBD resource because the peer is not outdated: 
0: State change failed: (-7) Refusing to be Primary while peer is not outdated 
Command 'drbdsetup 0 primary' terminated with exit code 11 

I have configured the CIB-based crm-fence-peer.sh utility in my drbd.conf 
fence-peer "/usr/lib/drbd/crm-fence-peer.sh"; 
but I do not believe it would be applicable in this scenario. 

If node0 goes offline like this and doesn't come back (e.g. after a STONITH), does Pacemaker have a way to tell node1 that its peer is outdated and to proceed with promoting the resource to primary? 

Thanks, 

Andrew 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121023/dfe33346/attachment.htm>