[DRBD-user] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

Eric Robinson eric.robinson at psmnv.com
Thu Oct 5 21:56:42 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi Lars --

I've been travelling and just saw your response, and now I'm travelling again. I am very eager to provide answers to your questions and will do so at my first opportunity!

--Eric 

-----Original Message-----
From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Lars Ellenberg
Sent: Tuesday, October 3, 2017 12:43 AM
To: drbd-user at lists.linbit.com
Subject: Re: [DRBD-user] Warning: Data Corruption Issue Discovered in DRBD 8.4 and 9.0

On Mon, Sep 25, 2017 at 09:02:57PM +0000, Eric Robinson wrote:
> Problem:
> 
> Under high write load, DRBD exhibits data corruption. In repeated 
> tests over a month-long period, file corruption occurred after 700-900 
> GB of data had been written to the DRBD volume.

Interesting.
Actually, alarming.

Can anyone else reproduce these findings?
In a similar or different environment?

> Testing Platform:
> 
> 2 x Dell PowerEdge R610 servers
> 32GB RAM
> 6 x Samsung SSD 840 Pro 512GB (latest firmware) Dell H200 JBOD 
> Controller SUSE Linux Enterprise Server 12 SP2 (kernel 4.4.74-92.32) 
> Gigabit network, 900 Mbps throughput, < 1ms latency, 0 packet loss
> 
> Initial Setup:
> 
>     Create 2 RAID-0 software arrays using either mdadm or LVM
>     On Array 1: sda5 through sdf5, create DRBD replicated volume (drbd0) with an ext4 filesystem
>     On Array 2: sda6 through sdf6, create LVM logical volume with an 
> ext4 filesystem
> 
> Procedure:
> 
>     Download and build the TrimTester SSD burn-in and TRIM verification tool from Algolia (https://github.com/algolia/trimtester).
>     Run TrimTester against the filesystem on drbd0, wait for corruption to occur
>     Run TrimTester against the non-drbd backed filesystem, wait for 
> corruption to occur
> 
> Results:
> 
> In multiple tests over a period of a month, TrimTester would report 
> file corruption when run against the DRBD volume after 700-900 GB of 
> data had been written. The error would usually appear within an hour 
> or two. However, when running it against the non-DRBD volume on the 
> same physical drives, no corruption would occur. We could let the 
> burn-in run for 15+ hours and write 20+ TB of data without a problem.
> Results were the same with DRBD 8.4 and 9.0.

Which *exact* DRBD module versions, identified by their git commit ids?

> We also tried disabling
> the TRIM-testing part of TrimTester and using it as a simple burn-in 
> tool, just to make sure that SSD TRIM was not a factor.

"to make sure SSD TRIM was not a factor":
how exactly did you try to do that?
What are the ext4 mount options,
explicit or implicit?
(as reported by tune2fs and /proc/mounts)

> Conclusion:
> 
> We are aware of some controversy surrounding the Samsung SSD 8XX 
> series drives; however, the issues related to that controversy were 
> resolved and no longer exist as of kernel 4.2. The 840 Pro drives are 
> confirmed to support RZAT. Also, the data corruption would only occur 
> when writing through the DRBD layer. It never occurred when bypassing 
> the DRBD layer and writing directly to the drives, so we must conclude 
> that DRBD has a data corruption bug under high write load.

Or that DRBD changes the timing / IO pattern seen by the backend sufficiently to expose a bug elsewhere.

> However, we would be more than happy to be proved wrong.

To gather a few more data points,
does the behavior on DRBD change, if you  disk { disable-write-same; } # introduced only with drbd 8.4.10 or if you set  disk  { al-updates no; } # affects timing, among other things

Can you reproduce with other backend devices?

--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD(r) and LINBIT(r) are registered trademarks of LINBIT __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________
drbd-user mailing list
drbd-user at lists.linbit.com
http://lists.linbit.com/mailman/listinfo/drbd-user



More information about the drbd-user mailing list