[DRBD-user] Problems with oos Sectors after verify

Wed Mar 17 08:08:20 CET 2010

Hi,

>

I have a problem running drbd 8.3.7-1 on Debian Lenny (2.6.26-AMD64-Xen). 
I have  six drbd devices with a total of 3 TB. Both nodes are Supermicro AMD 
Opteron boxes (one 12 core, one 4 core) with a dedicated 1 GBit connection for 
DRBD and Adaptec 5800 Raid controllers. One side is a NVIDIA forcedeth NIC, 
the other side an Intel e1000. Protocol is C. The dom0 has 2 GByte of RAM. 

Basically two symptoms can be observed but I am not sure if they are related:

1. Data Integrity errors
I get occasional data integrity errors (checksummed with crc32c) on both nodes 
in the cluster. 

[ 8961.266879] block drbd3: Digest integrity check FAILED.
[22846.253694] block drbd3: Digest integrity check FAILED.
[23557.272471] block drbd3: Digest integrity check FAILED.

Like recommended before I did the standard procedures (disable offloading, 
memtest, replacing cables, replacing one of the boxes) but without success. 
The  errors are only reported for devices wich the respective node is 
secondary for.

2. oos after verify
I always get a few oos sectors after verifying any device which has been used 
previously. These are no false positives, the sectors are in fact different:

2,5c2,5
< 0000010: 0000 0000 0800 0000 0000 00ff 0000 0000  ................
< 0000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
< 0000030: 0000 0000 ffff ffff ffff ffff 0000 0000  ................
< 0000040: 0000 0400 0000 0000 0000 0000 0000 0000  ................
---
> 0000010: 0000 0000 0800 0000 0000 19ff 0000 0000  ................
> 0000020: 0000 002b 0000 0000 0000 0000 0000 0000  ...+............
> 0000030: 0000 002b ffff ffff ffff ffff 0000 0000  ...+............
> 0000040: 0000 0400 0000 0000 0002 8668 0000 0000  ...........h....
8c8
< 0000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
---
> 0000070: 0000 0f03 0000 0000 0000 0001 0000 0000  ................

After dis/reconnect/resyncing the device, they are identical again. This 
happens   with random sectors and basically every verify.

Here my relevant global config for drbd.

       startup {
                wfc-timeout 60;
                degr-wfc-timeout 300;
        }

        disk {
                on-io-error detach;
        }

        net {
                cram-hmac-alg sha1;
                after-sb-0pri disconnect;
                after-sb-1pri disconnect;
                after-sb-2pri disconnect;
                data-integrity-alg crc32c;
                max-buffers 3000;
                max-epoch-size 8000;
        }

        syncer {
                rate 25M;
                verify-alg crc32c;
                csums-alg crc32c;
                al-extents 257;
        }

I tweaked the tcp settings using sysctl

net.ipv4.tcp_rmem = 131072  131072  16777216
net.ipv4.tcp_wmem = 131072  131072  16777216
net.core.rmem_max = 10485760 
net.core.wmem_max = 10485760 
net.ipv4.tcp_mem = 96000 128000 256000

I am not sure in which direction to search next and would be happy about any 
suggestions.

Thanks.

Regards,
Henning
COM+ IT Consulting