[DRBD-user] problems with xfs and drbd in disconnected mode

Mon Nov 27 13:20:00 CET 2006

Hi,

I started some benchmarks on a 4,5TB lvm wich is build on top of 3 drbd
devices (three 1,5TB volumes of a 4,5TB RAID5 with Areca Controller). The
write performance seems to be good and is only limited by the GbE link.

Setup:
Core 2 Duo 6600 CPU
4 GB RAM
Ubuntu Linux 6.06
2.6.15-27-amd64-server (just switched to x86-amd64 from i386)
drbd 0.7.15
directly connected GbE links

3 x 1,5TB drbd devices (internal meta data, sdc1/r0, sdd1/r1, sde1/r2) 3 x
1,5TB lvm --> 1 x 4,5TB lv with xfs as fs

I have some problems with xfs. Running tiobench with 8 threads on the
4,5TB lvm, the server hangs after some time. I'm still able to ping the
server, I get the ssh login banner and I'm able to enter user/password,
but then I'm not getting any further. The same happens on the console.
There is also a high system load (~13, well that's not _that_ high). There
is no other way than to reset the server and reboot. No messages on the
console, syslog just stops printing --Mark-- messages. I also stopped
tiobench once to avoid rebooting the system and tiobench was listed in the
process list as defunct. But the system load still grow although the cpu
was 100% idle and no disk activity happend. I had to reset the server
after a couple of minutes.

This problem does only happen if drbd is in disconnected mode (higher disk
throughput?). It does not happen with ext3 - which is a bit slower - and
not with a 4,5TB lvm without drbd.

Here are some benchmark results:

tiobench --numruns 3  --threads 8 --block 4096 --size 8000

File  Blk   Num                 Avg    Maximum      Lat%     Lat%    CPU
Size  Size  Thr   Rate(CPU%)  Latency  Latency      >2s      >10s    Eff

Sequential Reads  (no drbd! xfs)
8000  4096    8  106.42 77.62%  0.844  10901.35   0.00264  0.00000   137
Sequential Writes (no drbd! xfs)
8000  4096    8  199.21 278.7%  0.280  42462.46   0.00307  0.00020    71

Sequential Reads  (Connected xfs)
8000  4096    8   98.54 103.7%  0.912  11898.48   0.00302  0.00000    95
Sequential Writes (Connected xfs)
8000  4096    8   97.36 163.2%  0.666 118173.24   0.00336  0.00102    60

Sequential Reads (Connected ext3)
8000  4096    8   92.37 98.69%  0.988  12600.68   0.00307  0.00000    94
Sequential Writes (Connected ext3)
8000  4096    8   59.97 158.5%  1.221  58146.28   0.01143  0.00020    38

Sequential Reads   (Disconnected xfs)
system hangs
Sequential Wrtites (Disconnected xfs)
system hangs

Sequential Reads (Disconnected ext3)
8000  4096    8   77.53 87.43%  1.187  16140.48   0.00395  0.00000    89
Sequential Writes (Disconnected ext3)
8000  4096    8   64.43 143.7%  1.020 135364.99   0.00928  0.00323    45

One other thing I only noticed with ext3 and not with xfs:

kernel: [242696.450763] drbd2: [tiotest/7417] sock_sendmsg time expired,
ko = 3
kernel: [242794.614237] drbd2: [tiotest/7419] sock_sendmsg time expired,
ko = 3
kernel: [242820.882346] drbd2: [kjournald/7015] sock_sendmsg time expired,
ko = 3

Any ideas about this xfs/drbd problem? I'm a bit lost, because I don't see
any kernel messages or logfile entries. I'm not even sure if it's a
kernel, drbd, lvm or xfs problem, but it only occures in conjunction with
xfs and drbd.

Ralf