Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi guys, O.K. I have some reply from the customer. Lars, I'm still waiting to see about your question concerning write rates, but his response I'm cutting and pasting here might have a bit more clues as to what is going on. I should explain there are also 4 separate partitions being replicated, hence his reference to /dev/drbd2, which is one of them (should be clear in the config file following actually). So, here's what we have so far (kind of long): His response: ------------------------------------------------------------------------ ------ If I did understand the response from the e-mail it state to add the all_extents to 257 I did this and restart DRBD Unfortunately it didn't change the situation. The thing I would like to do is explain in detail what's happening and the way I see it. 1 - all_extents,protocol and snd_buffer parameter have been changed 2 - Take drbd down and back up on both side( to make sure the changed have take effect) 3 - Start a copy of a 300MB files on the partition /dev/drbd2 4 - The copy goes all the way 5 - After about 30 sec to 1 minute when the copy finish we can't have access to the /dev/drbd2 partition (true win samba or just doing a ls of the partition) all the other drbd partition and the system it self show no degradation. 6 - We see in the cat /proc/drbd the bytes of this partition going from primary to secondary 7 - When the copy is done from primary to secondary the partition /dev/drbd2 become back available and performance is back to normal on this partition(no other part of the linux system is affect by this) So what I see in all this. It look like drbd doesn't really do is copy from primary to secondary in the background. My impression was that drbd would complete is copy in the backround with out slowing down the access to the fs on the primary machine. I really hope this is not a concept issue. If I copy let say a 500MB files the same thing happen except it happen even before the copy to the primary finish and it can even abort the copy. <snip> To make all parition hang I would just copy 3 files in each partition and I would be able to hang all the drbd partition at once. They would only get available when the copy from primary to secondary complete. <snip> Below is the drbd.conf present on both machine. global { minor-count 4; dialog-refresh 600; } resource drbd0 { protocol A; incon-degr-cmd "/bin/true"; syncer { rate 70M; al-extents 257; } net { sndbuf-size 512k; } on smb-mtl01 { device /dev/drbd0; disk /dev/mapper/VGPrivate_orion-LVPrivate_orion; address 10.10.1.3:7789; meta-disk /dev/sdh1[0]; } on smb-sjo01 { device /dev/drbd0; disk /dev/mapper/VGPrivate_orion-LVPrivate_orion; address 10.10.2.66:7789; meta-disk /dev/sde1[0]; } } resource drbd1 { protocol A; incon-degr-cmd "/bin/true"; syncer { rate 70M; al-extents 257; } net { sndbuf-size 512k; } on smb-mtl01 { device /dev/drbd1; disk /dev/mapper/VGProfiles_orion-LVProfiles_orion; address 10.10.1.3:7790; meta-disk /dev/sdh1[1]; } on smb-sjo01 { device /dev/drbd1; disk /dev/mapper/VGProfiles_orion-LVProfiles_orion; address 10.10.2.66:7790; meta-disk /dev/sde1[1]; } } resource drbd2 { protocol A; incon-degr-cmd "/bin/true"; syncer { rate 70M; al-extents 257; } net { sndbuf-size 512k; } on smb-mtl01 { device /dev/drbd2; disk /dev/mapper/VGPublic_orion-LVPublic_orion; address 10.10.1.3:7791; meta-disk /dev/sdh1[2]; } on smb-sjo01 { device /dev/drbd2; disk /dev/mapper/VGPublic_orion-LVPublic_orion; address 10.10.2.66:7791; meta-disk /dev/sde1[2]; } } ----------------------------------------------------------- End customer response... Hope this sheds some light for anybody? Looks like they were also playing with the dialog-refresh to see if this might help, but everything else looks o.k. to me. Thanks Tim Tim Johnson Senior Software Engineer Vision Solutions, Inc. 17911 Von Karman Ave, 5th Floor Irvine, CA 92614 UNITED STATES Tel: +1 (949) 253-6528 Fax: +1 (949) 225-0287 Email: tjohnson at visionsolutions.com <http://www.visionsolutions.com/> Disclaimer - 6/22/2006 The contents of this e-mail (and any attachments) are confidential, may be privileged, and may contain copyright material of Vision Solutions, Inc. or third parties. You may only reproduce or distribute the material if you are expressly authorized by Vision Solutions to do so. If you are not the intended recipient, any use, disclosure or copying of this e-mail (and any attachments) is unauthorized. If you have received this e-mail in error, please immediately delete it and any copies of it from your system and notify us via e-mail at helpdesk at visionsolutions.com -----Original Message----- From: drbd-user-bounces at lists.linbit.com [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of jeffb Sent: Wednesday, June 21, 2006 11:38 AM To: drbd-user at lists.linbit.com Subject: Re: [DRBD-user] Apologies...wrong subject: should have been drbdperformance issue.. I think you can also mess with your al-extents setting to increase performance, but I think it only helps you get through bursts of size[al-extents * 4M] and has other repercussions later as well. (You'll resync that much after a primary crash). I think this mostly helps on systems with sub-optimal disks or controller cards. We had to set with high with our 3ware cards, but with our Areca's we've been just fine.. The problem with our 3ware cards was that when we had x-fer's > al-extents * 4M the system would be fine until it had xfer'd that amount, or a little bit more, but then it would nearly deadlock after that limit was reached. Kinda like the sorta thing that could keep people from logging in effectively, or keeping their active connections.. Our system would go into disk IO deadlock for about 30 seconds -> a minute, then come out of it for about 10 seconds, then go right back into another deadlock. This would continue until shortly after our large transfer's were done.. Any transfer that was smaller than al-extents * 4M would never have this problem (unless run too closely to another large transfer). I'm 100% sure that our problem was related to a problem with either our disks or our raid controller (3ware 85xx), and not a problem with drbd. The 3ware cards seemed to be OK if you only had a couple or a few disks, but when you kept adding more drives the problem would get progressively worse when running in a raid 5 configuration. We had either 8 or 12 drives on our system and it ugly painful. On Wed, 2006-06-21 at 10:47 -0700, Tim Johnson wrote: > Tim Johnson > Senior Software Engineer > Vision Solutions, Inc. > > 17911 Von Karman Ave, 5th Floor > Irvine, CA 92614 > UNITED STATES > > Tel: +1 (949) 253-6528 > Fax: +1 (949) 225-0287 > Email: tjohnson at visionsolutions.com > <http://www.visionsolutions.com/> > Disclaimer - 6/21/2006 > The contents of this e-mail (and any attachments) are confidential, > may be privileged, and may contain copyright material of Vision > Solutions, Inc. or third parties. You may only reproduce or distribute > the material if you are expressly authorized by Vision Solutions to do > so. If you are not the intended recipient, any use, disclosure or > copying of this e-mail (and any attachments) is unauthorized. If you > have received this e-mail in error, please immediately delete it and > any copies of it from your system and notify us via e-mail at > helpdesk at visionsolutions.com -----Original Message----- > From: drbd-user-bounces at lists.linbit.com > [mailto:drbd-user-bounces at lists.linbit.com] On Behalf Of Tim Johnson > Sent: Wednesday, June 21, 2006 10:47 AM > To: drbd-user at lists.linbit.com > Subject: RE: [DRBD-user] drbd and lvm understanding question > > Hi guys, > > We've got a bit of a problem at a customer site and I was wondering if > anybody had any suggestions. With drbd up and running on both the > primary and backup, massive amounts of data were copied over to the > relevant mount points on the primary. Apparently this slowed the > machine down so much (haven't yet been able to get from them if it was > the CPU usage or memory) that users were getting kicked off. When > drbd was taken down on the backup, then everything was o.k. They > started with drbd version 0.7.13. I perused the archives of this > mailing list and found something which suggested that this was a > problem fixed after 0.7.13, so they upgraded to 0.7.19 and are still having the problem. > Parameters we've thought might be appropriate in the drbd.conf file > are protocol (using protocol A: I'm sure this is fine), sndbuf-size > (warnings using large values like 1M), max-buffers (this looks > promising to me), max-epoch-size, and maybe rate. I'm a bit nervous > about changing anything, so does anybody have some good ideas? > > Appropriate environmental information such as proc/drbd and system > info is below: > > Thanks, > Tim > > ----------------------------------------------------------------- > Output from /proc/drbd > smb-mtl01:~ # cat /proc/drbd > version: 0.7.19 (api:78/proto:74) > SVN Revision: 2212 build by root at smb-mtl01, 2006-06-13 10:56:44 > 0: cs:Connected st:Primary/Secondary ld:Consistent > ns:0 nr:0 dw:2112 dr:209304 al:522 bm:0 lo:0 pe:0 ua:0 ap:0 > 1: cs:Connected st:Primary/Secondary ld:Consistent > ns:32 nr:0 dw:352 dr:86220 al:3 bm:2 lo:0 pe:0 ua:0 ap:0 > 2: cs:Connected st:Primary/Secondary ld:Consistent > ns:32 nr:0 dw:1632 dr:123328 al:393 bm:3 lo:0 pe:0 ua:0 ap:0 > > > smb-sjo01:~ # cat /proc/drbd > version: 0.7.19 (api:78/proto:74) > SVN Revision: 2212 build by root at smb-sjo01, 2006-06-12 14:55:54 > 0: cs:Connected st:Secondary/Primary ld:Consistent > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 > 1: cs:Connected st:Secondary/Primary ld:Consistent > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 > 2: cs:Connected st:Secondary/Primary ld:Consistent > ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 > ------------------------------------------------------------------ > > Machine 1: (primary, I believe) > Link speed: 10 MBps (T1 line) > 1 NIC 100 MB/s. > SuSe 9 > Hardware Configuration: > Manufactured by IBM (iSeries) running on top of AS/400 810. > Has 1 CPU at 1GHz. > Memory is 1024 MB. > 200 GB disk > > > Machine 2: > Also SuSE 9. > Same IBM running on iSeries. > 1 CPU at 1 GHz. > 512 MB memory. > 150 GB disk space. > 1 NIC at 10 MB/s > > Replicating about 100 GB of data. > > > Tim Johnson > Senior Software Engineer > Vision Solutions, Inc. > > 17911 Von Karman Ave, 5th Floor > Irvine, CA 92614 > UNITED STATES > > Tel: +1 (949) 253-6528 > Fax: +1 (949) 225-0287 > Email: tjohnson at visionsolutions.com > <http://www.visionsolutions.com/> > Disclaimer - 6/21/2006 > The contents of this e-mail (and any attachments) are confidential, > may be privileged, and may contain copyright material of Vision > Solutions, Inc. or third parties. You may only reproduce or distribute > the material if you are expressly authorized by Vision Solutions to do > so. If you are not the intended recipient, any use, disclosure or > copying of this e-mail (and any attachments) is unauthorized. If you > have received this e-mail in error, please immediately delete it and > any copies of it from your system and notify us via e-mail at > helpdesk at visionsolutions.com > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user _______________________________________________ drbd-user mailing list drbd-user at lists.linbit.com http://lists.linbit.com/mailman/listinfo/drbd-user