[Pacemaker] Managing DRBD Dual Primary with Pacemaker always initial Split Brains

Wed Oct 1 20:46:03 CEST 2014

On 01/10/14 02:21 PM, Felix Zachlod wrote:
> Hello!
>
> I'm currently experimenting how a good DRBD Dual Primary Setup can be
> achieved with Pacemaker. I know all of the "you have to have good
> fencing in place things" ... that is just what I'am currently trying to
> test in my setup beside other things.

Good fencing prevents split-brains. You really do need it, and now is 
the time to get it working.

> But even without a node crashing or the link dropping I already have the
> problem that I always run into a split brain situation when a node comes
> up that was e.g. in Standby before.

Can you configure fencing, then reproduce? It will make debugging 
easier. Without fencing, things (drbd and pacemaker included) behave in 
somewhat unpredictable ways. With fencing, it should be easier to 
isolate the cause of the break.

> For example: I have both Nodes running connected both primary,
> everything is fine. I put one node into standby and DRBD is stopped on
> this node.
>
> I do some work, reboot the server and so on finally I try to re join the
> node in the cluster. Pacemaker is starting all resources and finally
> DRBD drops the connection informing me about a split brain.
>
> In the log this looks like:
>
> Oct  1 19:44:42 storage-test-d kernel: [  111.138512] block drbd10:
> disk( Diskless -> Attaching )
> Oct  1 19:44:42 storage-test-d kernel: [  111.139283] drbd testdata1:
> Method to ensure write ordering: drain
> Oct  1 19:44:42 storage-test-d kernel: [  111.139288] block drbd10: max
> BIO size = 1048576
> Oct  1 19:44:42 storage-test-d kernel: [  111.139296] block drbd10:
> drbd_bm_resize called with capacity == 838835128
> Oct  1 19:44:42 storage-test-d kernel: [  111.144488] block drbd10:
> resync bitmap: bits=104854391 words=1638350 pages=3200
> Oct  1 19:44:42 storage-test-d kernel: [  111.144494] block drbd10: size
> = 400 GB (419417564 KB)
> Oct  1 19:44:42 storage-test-d kernel: [  111.289327] block drbd10:
> recounting of set bits took additional 3 jiffies
> Oct  1 19:44:42 storage-test-d kernel: [  111.289334] block drbd10: 0 KB
> (0 bits) marked out-of-sync by on disk bit-map.
> Oct  1 19:44:42 storage-test-d kernel: [  111.289346] block drbd10:
> disk( Attaching -> UpToDate )
> Oct  1 19:44:42 storage-test-d kernel: [  111.289352] block drbd10:
> attached to UUIDs
> A41D74E79299A144:0000000000000000:86B0140AA1A527C0:86AF140AA1A527C1
> Oct  1 19:44:42 storage-test-d kernel: [  111.321564] drbd testdata2:
> conn( StandAlone -> Unconnected )
> Oct  1 19:44:42 storage-test-d kernel: [  111.321628] drbd testdata2:
> Starting receiver thread (from drbd_w_testdata [3211])
> Oct  1 19:44:42 storage-test-d kernel: [  111.321794] drbd testdata2:
> receiver (re)started
> Oct  1 19:44:42 storage-test-d kernel: [  111.321822] drbd testdata2:
> conn( Unconnected -> WFConnection )
> Oct  1 19:44:42 storage-test-d kernel: [  111.337708] drbd testdata1:
> conn( StandAlone -> Unconnected )
> Oct  1 19:44:42 storage-test-d kernel: [  111.337764] drbd testdata1:
> Starting receiver thread (from drbd_w_testdata [3215])
> Oct  1 19:44:42 storage-test-d kernel: [  111.337904] drbd testdata1:
> receiver (re)started
> Oct  1 19:44:42 storage-test-d kernel: [  111.337927] drbd testdata1:
> conn( Unconnected -> WFConnection )
> Oct  1 19:44:43 storage-test-d kernel: [  111.808897] block drbd10:
> role( Secondary -> Primary )
> Oct  1 19:44:43 storage-test-d kernel: [  111.810883] block drbd11:
> role( Secondary -> Primary )
> Oct  1 19:44:43 storage-test-d kernel: [  111.820040] drbd testdata2:
> Handshake successful: Agreed network protocol version 101
> Oct  1 19:44:43 storage-test-d kernel: [  111.820046] drbd testdata2:
> Agreed to support TRIM on protocol level
> Oct  1 19:44:43 storage-test-d kernel: [  111.823292] block drbd10: new
> current UUID
> 8369EB6F395C0D29:A41D74E79299A144:86B0140AA1A527C0:86AF140AA1A527C1
> Oct  1 19:44:43 storage-test-d kernel: [  111.836096] drbd testdata1:
> Handshake successful: Agreed network protocol version 101
> Oct  1 19:44:43 storage-test-d kernel: [  111.836108] drbd testdata1:
> Agreed to support TRIM on protocol level
> Oct  1 19:44:43 storage-test-d kernel: [  111.848917] block drbd11: new
> current UUID
> 69A056C665A38F35:C8B4320C2FE11A0C:D13C0AA6DC58CC8C:D13B0AA6DC58CC8D
> Oct  1 19:44:43 storage-test-d kernel: [  111.871100] drbd testdata2:
> conn( WFConnection -> WFReportParams )
> Oct  1 19:44:43 storage-test-d kernel: [  111.871108] drbd testdata2:
> Starting asender thread (from drbd_r_testdata [3249])
> Oct  1 19:44:43 storage-test-d kernel: [  111.909687] drbd testdata1:
> conn( WFConnection -> WFReportParams )
> Oct  1 19:44:43 storage-test-d kernel: [  111.909695] drbd testdata1:
> Starting asender thread (from drbd_r_testdata [3270])
> Oct  1 19:44:43 storage-test-d kernel: [  111.943986] drbd testdata2:
> meta connection shut down by peer.
> Oct  1 19:44:43 storage-test-d kernel: [  111.944063] drbd testdata2:
> conn( WFReportParams -> NetworkFailure )
> Oct  1 19:44:43 storage-test-d kernel: [  111.944067] drbd testdata2:
> asender terminated
> Oct  1 19:44:43 storage-test-d kernel: [  111.944070] drbd testdata2:
> Terminating drbd_a_testdata
> Oct  1 19:44:43 storage-test-d kernel: [  111.988005] drbd testdata1:
> meta connection shut down by peer.
> Oct  1 19:44:43 storage-test-d kernel: [  111.988089] drbd testdata1:
> conn( WFReportParams -> NetworkFailure )
> Oct  1 19:44:43 storage-test-d kernel: [  111.988094] drbd testdata1:
> asender terminated
> Oct  1 19:44:43 storage-test-d kernel: [  111.988098] drbd testdata1:
> Terminating drbd_a_testdata
> Oct  1 19:44:43 storage-test-d kernel: [  112.031948] drbd testdata2:
> Connection closed
> Oct  1 19:44:43 storage-test-d kernel: [  112.032116] drbd testdata2:
> conn( NetworkFailure -> Unconnected )
> Oct  1 19:44:43 storage-test-d kernel: [  112.032121] drbd testdata2:
> receiver terminated
> Oct  1 19:44:43 storage-test-d kernel: [  112.032124] drbd testdata2:
> Restarting receiver thread
> Oct  1 19:44:43 storage-test-d kernel: [  112.032127] drbd testdata2:
> receiver (re)started
> Oct  1 19:44:43 storage-test-d kernel: [  112.032136] drbd testdata2:
> conn( Unconnected -> WFConnection )
> Oct  1 19:44:43 storage-test-d kernel: [  112.096002] drbd testdata1:
> Connection closed
> Oct  1 19:44:43 storage-test-d kernel: [  112.096194] drbd testdata1:
> conn( NetworkFailure -> Unconnected )

At some point along the way, both nodes were Primary while not 
connected, even if for just a moment. Your log snippet above shows the 
results of this break, they do not appear to speak to the break itself.

I would do this;

With both nodes (drbd) connected and primary, open a new terminal and 
start 'tail -f -n 0 /var/log/messages'. Reproduce the process you 
describe above up to the reconnect where the split-brain is detected. 
Paste the complete log output from both nodes. Somewhere in there should 
be evidence of where the split-brain occurred.

> To resolve this problem I simply put the cluster into maintenance mode,
> stop drbd on the one node which I just brought back on and reconnect on
> the other side then start the other node's DRBD again and it finally
> connects without a problem into Secondary/Primary state. Without
> enforcing data being dropped. Afterwards I can go back into
> Primary/Primary. In a real world setup at this point the fencing would
> have kicked in and with bit of bad luck even fenced the healthy node (as
> I already saw the split brain detected messages on either side of the
> cluster), bringing the possibly outdated side up with all of it's
> ancient data.

I'd take an errant fence over a split-brain any day. That said, that you 
can recover without invalidating one of the nodes is very odd. What 
version of DRBD are you using?

> As I can see from the logs the resource is being promoted already even
> it is still in WFConnection state. I assume this might be be problem
> here, that both sides are primary already when they come to the point
> where the connection is established and then one node drops the
> connection. I don't think that this can be the desired behaviour. How
> can pacemaker be made aware of that it is promoting drbd only if it is
> already in a connected state and with (assumed) good data? Or to say

If you stop DRBD on a node, the other should stay in WFConnection state. 
Otherwise, it won't hear the connection request when you restart the peer.

You are right though, if a node goes Primary *before* connecting, you 
have a split-brain. The start order should be attach -> connect -> 
primary. Why it's not doing that is to be determined, but it would 
certainly be a problem if you're seeing attach -> primary -> connect.

> 1. bring up drbd into secondary
> 2. let drbd determine if data has to be resynced and so on
> 3. when drbd is finally in "Secondary/UpToDate" state promote it. and
> afterwards start services that rely on the drbd device. If something
> goes wrong the promote should fail and the cluster could finally fence
> the outdated node. I know there might be a RARE situation where it might
> be necesary to start a Secondary/Unknown node up to Primary (e.g.
> Cluster was degraded and for some reason the remainung good node had
> restarted (or had to be restarted) - but this might be a thing that
> could be handled manually.

As soon as the node is Connected and one of the nodes is UpToDate, you 
can go Primary. If the node is not yet UpToDate and the peer disappears 
though, it will immediately drop back to Secondary.

> This is a portion from the cluster config (should generally speaking be
> everything that is related to drbd directly):
>
> primitive drbd_testdata1 ocf:linbit:drbd \
>          params drbd_resource="testdata1" \
>          op monitor interval="29s" role="Master" \
>          op monitor interval="31s" role="Slave"
>
> ms ms_drbd_testdata1 drbd_testdata1 \
>          meta master-max="2" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true" target-role="Master"
>
> location l-drbd1 ms_drbd_testdata1 \
>          rule $id="l-drbd1-rule" 0: #uname eq storage-test-d or #uname
> eq storage-test-c
>
>
> Thanks for any hints in advance,
> Felix

Please configure stonith in pacemaker, test it, then hook DRBD into 
pacemaker's fencing via crm-fence-peer.sh and set the fencing policy to 
'resource-and-stonith'. Reproduce and report back please.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?