[ClusterLabs] Antw: Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0
Ulrich Windl
Ulrich.Windl at rz.uni-regensburg.de
Tue Sep 26 02:28:49 EDT 2017
>>> Eric Robinson <eric.robinson at psmnv.com> schrieb am 25.09.2017 um 23:03 in
Nachricht
<DM5PR03MB27290264E6E09F5C8F744467FA7A0 at DM5PR03MB2729.namprd03.prod.outlook.com>
> Problem:
>
> Under high write load, DRBD exhibits data corruption. In repeated tests over
> a month-long period, file corruption occurred after 700-900 GB of data had been
> written to the DRBD volume.
>
> Testing Platform:
>
> 2 x Dell PowerEdge R610 servers
> 32GB RAM
> 6 x Samsung SSD 840 Pro 512GB (latest firmware)
> Dell H200 JBOD Controller
> SUSE Linux Enterprise Server 12 SP2 (kernel 4.4.74-92.32)
> Gigabit network, 900 Mbps throughput, < 1ms latency, 0 packet loss
>
> Initial Setup:
>
> Create 2 RAID-0 software arrays using either mdadm or LVM
> On Array 1: sda5 through sdf5, create DRBD replicated volume
> (drbd0) with an ext4 filesystem
> On Array 2: sda6 through sdf6, create LVM logical volume
> with an ext4 filesystem
>
> Procedure:
>
> Download and build the TrimTester SSD burn-in and TRIM
> verification tool from Algolia (https://github.com/algolia/trimtester).
> Run TrimTester against the filesystem on drbd0, wait for
> corruption to occur
> Run TrimTester against the non-drbd backed filesystem, wait
> for corruption to occur
I don't know the tool, but isn't the expectation a bit high that the tool will trim the correct blocks throuch drbd->LVM/mdadm->device? Why not use the tool on the affected devices directly?
>
> Results:
>
> In multiple tests over a period of a month, TrimTester would report file
> corruption when run against the DRBD volume after 700-900 GB of data had been
> written. The error would usually appear within an hour or two. However, when
> running it against the non-DRBD volume on the same physical drives, no
> corruption would occur. We could let the burn-in run for 15+ hours and write
> 20+ TB of data without a problem. Results were the same with DRBD 8.4 and
> 9.0. We also tried disabling the TRIM-testing part of TrimTester and using it
> as a simple burn-in tool, just to make sure that SSD TRIM was not a factor.
>
> Conclusion:
>
> We are aware of some controversy surrounding the Samsung SSD 8XX series
> drives; however, the issues related to that controversy were resolved and no
> longer exist as of kernel 4.2. The 840 Pro drives are confirmed to support
> RZAT. Also, the data corruption would only occur when writing through the
> DRBD layer. It never occurred when bypassing the DRBD layer and writing
> directly to the drives, so we must conclude that DRBD has a data corruption
> bug under high write load. However, we would be more than happy to be proved
> wrong.
>
> --
> Eric Robinson
More information about the Users
mailing list