Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everyone, we have this fairly simple setup where we have two CentOS 5.5 nodes running xen 3.4.2 compiled from sources (kernel 2.6.18-xen) and DRBD 8.3.7 also compiled from sources. Both nodes have two data partitions which are synced by DRBD. Each node is running a single VM from either of the partitions in a standard Primary/Secondary mode. This way each node can fully utilize its CPU and memory resources and we still have storage failover capabilities. The VMs are using the drbd devices directly (no LVM and such). Both nodes are connected through a gigabit ethernet port and a crossover cable. Over time as the VM resource usage raised it started behaving strangely. After investigating, everything points to an IO problem as read and writes are very slow. My tests have shows that while the DRBD replication is connected and running, IO performance is very bad. Not only is it bad inside the VM but also on the host node. This is as if DRBD would cause the underlying IO subsystem to become very slow. Now I should say that the servers are using Adaptec 5405 raid cards with BBUs and write cache enabled. As for disks, we have 4x SATA drives configured as a RAID-10. As soon as I disconnect DRBD, the IO performance is way better both inside and outside the VMs. Xen VM config: disk = [ 'drbd:drbd0,sda,w' ] # drbdsetup /dev/drbd1 show disk { size 0s _is_default; # bytes on-io-error detach; fencing dont-care _is_default; no-disk-barrier ; no-disk-flushes ; no-md-flushes ; max-bio-bvecs 0 _is_default; } net { timeout 60 _is_default; # 1/10 seconds max-epoch-size 8192; max-buffers 8192; unplug-watermark 128 _is_default; connect-int 10 _is_default; # seconds ping-int 10 _is_default; # seconds sndbuf-size 0 _is_default; # bytes rcvbuf-size 0 _is_default; # bytes ko-count 0 _is_default; cram-hmac-alg "sha1"; shared-secret "secret"; after-sb-0pri discard-zero-changes; after-sb-1pri discard-secondary; after-sb-2pri disconnect _is_default; rr-conflict disconnect _is_default; ping-timeout 5 _is_default; # 1/10 seconds } syncer { rate 33792k; # bytes/second after -1 _is_default; al-extents 1801; verify-alg "crc32c"; } protocol C; _this_host { device minor 1; disk "/dev/sda7"; meta-disk internal; address ipv4 10.10.0.1:7789; } _remote_host { address ipv4 10.10.0.2:7789; } I have also noticed that the 'lo' and 'ua' values were usually fairly high in /proc/drbd. Also, the activity log updates are increasing fairly rapidly at 10 updates a second. # On the primary node 1: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- ns:1912226456 nr:0 dw:1406464980 dr:503249931 al:153036012 bm:3232164 lo:0 pe:36 ua:0 ap:35 ep:1 wo:d oos:0 # Secondary node 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- ns:0 nr:10502904 dw:1911380520 dr:0 al:0 bm:45648 lo:38 pe:0 ua:38 ap:0 ep:1 wo:d oos:0 Any ideas? Thanks -- Jean-Francois Chevrette