[DRBD-user] Tracking down sources of corruption examined by drbdadm verify

Szeróvay Gergely gergely.szerovay at gmail.com
Thu Apr 17 13:15:21 CEST 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello All,

I have a 3 node system. In the system I have 25 DRBD mirrored
partition, their total size is about 250GB. The 3 node is:
- immortal: Intel 82573L Gigabit Ethernet NIC (kernel 2.6.21.6,
driver: e1000, version: 7.3.20-k2-NAPI, firmware-version: 0.5-7)
- endless: Intel 82566DM-2 Gigabit Ethernet NIC (kernel 2.6.22.18,
driver: e1000, version: 7.6.15.4,
firmware-version: 1.3-0)
- infinity: Intel 82573E Gigabit Ethernet NIC (kernel 2.6.22.18,
driver: e1000, version: 7.6.15.4, firmware-version: 3.1-7)

One month ago I switched to DRBD 8.2.5 from 7.x. Before I used the 7.x
series without problems. I had no problem during the update, the parts
of the mirrors connected and synced cleanly.

After updating I started to verify the DRBD volumes:
- most of them has usually not out-of-sync blocks
- one has 2-3 new oos block almost every day
- a few of them has a new oos block about every week

I try to track down the source of oos blocks. I read through the
drbd-user forums, in the „Tracking down sources of corruption
(possibly) detected by drbdadm verify" thread I found very useful
hints.

I cheked my network connections between every node, every direction
with this test:

host1:~ # md5sum /tmp/file_with_1GB_random_data
host2:~ # netcat -l -p 4999 | md5sum
host1:~ # netcat -q0 192.168.x.x 4999 < /tmp/file_with_1GB_random_data

The test always gives the same md5sums on the two tested node, the
transfer speed is about 100MB/sec when the file is cached.

I repeated this test between every node-pairs many times, I found no
md5 mismatch.

I saved the oos blocks from the underlying device.  I used commands like this:

host:~ dd iflag=direct bs=512 skip=11993992 count=8
if=/dev/immortal0/65data2 | xxd -a > ./primary_4k_dump

when the syslog message was

„Apr 17 11:14:09 immortal kernel: drbd6: Out of sync: start=11993992,
size=8 (sectors)"

and the primary underlying device was /dev/immortal0/65data2.

I compared the problematic blocks from the two nodes with diff:
host:~ diff ./primary_4k_dump ./secondary_4k_dump

I usually found 1-2byte difference between the blocks on the two node,
but one time I found that the last 1336 bytes of block was zeroed out
(on the other node  it has "random" data).Two example:

1 4k block oos:
c2
< 0000010: 0000 0000 1500 0000 0000 01ff 0000 0000  ................
---
> 0000010: 0000 0000 1500 0000 0001 01ff 0000 0000  ................

another 1 4k block oos:
22c22
< 00001f0: 0b85 0000 0000 0000 1800 0000 0000 0000  ................
---
> 00001f0: 2d79 0000 0000 0000 1800 0000 0000 0000  -y..............

Any idea would help.

Thank you all for your time: Gergely Szerovay



More information about the drbd-user mailing list