[Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

Thu Oct 2 21:07:31 CEST 2014

Am 02.10.2014 18:02, schrieb Digimer:
> On 02/10/14 02:44 AM, Felix Zachlod wrote:
>> I am currently running 8.4.5 on to of Debian Wheezy with Pacemaker 1.1.7
>
> Please upgrade to 1.1.10+!
>

Are you referring to a special bug/ code change? I normally don't like 
building all this stuff from source instead using the packages if there 
are not very good reasons for it. I run some 1.1.7 debian base pacemaker 
clusters for a long time now without any issue and I am sure that this 
version seems to run very stable so as long as I am not facing a 
specific problem with this version I'd prefer sticking to it rather than 
putting brand new stuff from source together which might face other 
compatibility issues later on.

I am nearly sure that I found a hint to the problem:

adjust_master_score (string, [5 10 1000 10000]): master score adjustments
     Space separated list of four master score adjustments for different 
scenarios:
      - only access to 'consistent' data
      - only remote access to 'uptodate' data
      - currently Secondary, local access to 'uptodate' data, but remote 
is unknown

This is from the drbd resource agent's meta data.

As you can see the RA will report a master score of 1000 if it is 
secondary and (thinks) it has up to date data. According to the logs it 
is reporting 1000 though... I set a location rule with a score of -1001 
for the Master role and finally Pacemaker is waiting to promote the 
nodes to Master till the next monitor action when it notices until the 
nodes are connected and synced and report a MS of 10000. What is 
interesting to me is

a) why do both drbd nodes think they have uptodate data when coming back 
online- at least one should know that it has been disconnected when 
another node was still up and consider that data might have been changed 
in the meantime. and in case I am rebooting a single node it can almost 
be sure that it has only "consistent" data cause the other side was 
still primary when shutting down this one

b) why does obviously nobody face this problem as it should behave like 
this in any primary primary cluster

but I think I will try passing this on to the drbd mailing list too.

regards, Felix