Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello all, My current drbd-0.6.9ish production fileserver systems have > 2 TB of ext3 filesystems on raw drbd devices; but I am upgrading them and I would prefer to move to xfs on lvm2. The new systems are dual-Xeons running drbd-0.7.7 on Gentoo with kernel 2.6.10, with a bonded gigabit cross connect. Each host is hooked up to a ~3.3 TB external SCSI-to-SATA disk array. The devices are layered xfs/lvm/drbd/sd[c1,c2,d1]. Thus the volume group consists of three ~1TB /dev/drbd devices as physical volumes. In my early tests (using a single 1.3 TB device) I was able to move the drbd primary from one server to the other, then do an lvscan, and the second machine would pick up the volume group just fine. Now, however, I am getting an error on the largest (1.3 TB) of the three /dev/drbd physical volumes, despite drbd appearing to function normally. Here mason3 is the original primary and mason4 the standby: mason3 etc # vgchange -an vg mason3 etc # ha.d/resource.d/drbddisk stop mason4 etc # ha.d/resource.d/drbddisk start mason4 etc # vgscan Reading all physical volumes. This may take a while... Couldn't find device with uuid 'PU25qj-ayHw-vneG-zQmr-WWRV-tuRW-HF51b1'. Couldn't find all physical volumes for volume group vg. Volume group "vg" not found mason4 etc # cat /proc/drbd version: 0.7.7 (api:77/proto:74) SVN Revision: 1680 build by root at mason4, 2005-01-10 00:47:37 0: cs:Connected st:Primary/Secondary ld:Consistent ns:0 nr:235033 dw:235033 dr:208 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 1: cs:Connected st:Primary/Secondary ld:Consistent ns:0 nr:1264066009 dw:1264066009 dr:0 al:0 bm:154278 lo:0 pe:0 ua:0 ap:0 2: cs:Connected st:Primary/Secondary ld:Consistent ns:0 nr:132104 dw:132104 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 mason4 etc # The device in question is /dev/drbd1. I tried doing an invalidate on mason4 and resyncing, which didn't seem to help. However if I move the primary back over to mason3, and remove this one physical volume from the VG, then fail back to mason4, then mason4 will take over the VG just fine. It just doesn't like the one physical device, /dev/drbd1. So my questions are: 1. Is the device actually replicating properly, or is /proc/drbd mistaken? 2. Is this amount of data exceeding some limit on my 32-bit x86 systems that causes it to fail, but not to report an error? I had this problem under drbd 0.6.x, and solved it by adjusting the 2.4 kernel for a larger vmalloc size of 512MB. I'm not sure if this is still a problem under 2.6/0.7; in any case drbd is not returning errors. 3. Is the best way to get around this problem to lay drbd on top of LVM, as I saw Phillip did in his performance tests and have seen from others? I see this solution as more awkward, b/c you have to admin the servers' VG's individually, but it could still work now that there is "drbdadm resize". Thanks very much for any help! -- Trey Palmer Systems Development Engineer trey at isye.gatech.edu Georgia Tech Industrial and Systems Engineering 404-385-3080