Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Dear Lars, I found a server at lock when I got to my desk this morning. Not wanting to waist any time, these are the numbers you ask for. Lock on 'hp-tm-02', twin with 'hp-tm-04' which is partially locked. I use the term 'lock' to explain a server with high load and very much reduced throughput. This is best shown with an xload plot. The plot encloses shows the servers side-by-side. Second is in lock (hp-tm-02), first is twin in partial lock (hp-tm-04). A normal xload plot would not show any load above 1.0 on either server. Sar on hp-tm-02: (locked) Note the 'iowait' which is best indicator of a 'lock'. # sar -u 2 10 Linux 2.6.18.8-0.7-default (hp-tm-02) 01/08/08 08:22:26 CPU %user %nice %system %iowait %idle 08:22:28 all 0.75 0.00 0.50 24.84 73.91 08:22:30 all 2.50 0.00 0.88 44.75 51.88 08:22:32 all 0.25 0.00 0.25 40.02 59.48 08:22:34 all 0.75 0.00 0.00 25.06 74.19 08:22:36 all 0.50 0.00 0.37 25.34 73.78 08:22:38 all 3.50 0.00 1.00 36.33 59.18 08:22:40 all 7.60 0.00 1.25 28.64 62.52 08:22:42 all 1.87 0.00 0.50 25.94 71.70 08:22:44 all 0.62 0.00 1.62 42.09 55.67 08:22:46 all 0.50 0.00 0.50 49.94 49.06 Average: all 1.88 0.00 0.69 34.29 63.14 Sar on hp-tm-04: (partially locked) # sar -u 2 10 Linux 2.6.18.8-0.7-default (hp-tm-04) 01/08/08 08:22:37 CPU %user %nice %system %iowait %idle 08:22:39 all 24.53 0.00 1.87 3.99 69.61 08:22:41 all 17.31 0.00 1.25 8.91 72.52 08:22:43 all 6.52 0.00 1.38 30.20 61.90 08:22:45 all 4.87 0.00 2.12 20.97 72.03 08:22:47 all 3.87 0.00 0.25 21.45 74.44 08:22:49 all 3.12 0.00 0.50 22.07 74.31 08:22:51 all 0.50 0.00 0.12 24.84 74.53 08:22:53 all 0.62 0.00 0.37 24.94 74.06 08:22:55 all 5.98 0.00 0.75 25.78 67.50 08:22:57 all 8.74 0.00 0.12 16.98 74.16 Average: all 7.60 0.00 0.87 20.01 71.51 > try to avoid line breaks :) Very sorry, my email client. I have adjusted to see if this helps. hp-tm-02: (lock) version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by root at hp-tm-02, 2007-12-19 22:25:46 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- ns:0 nr:91896348 dw:91896340 dr:0 al:0 bm:0 lo:2 pe:0 ua:1 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r--- ns:125994544 nr:0 dw:125994540 dr:57151581 al:477198 bm:0 lo:2 pe:0 ua:0 ap:2 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:2/257 hits:31021438 misses:483228 starving:102 dirty:5998 changed:477198 > and the other node, during "lock"? > (find information always on both nodes) hp-tm-04: (partial lock) version: 8.2.1 (api:86/proto:86-87) GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by root at hp-tm-04, 2007-12-19 22:26:42 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r--- ns:91909804 nr:0 dw:91909804 dr:29581641 al:477780 bm:0 lo:0 pe:2 ua:0 ap:1 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:1/257 hits:22499671 misses:478376 starving:0 dirty:596 changed:477780 1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- ns:0 nr:125997044 dw:125997044 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0 act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0 > the "lo:NNN", on the Secondary, does that change? > I mean, does this gauge change, > and eventually decrease to zero, during "lock"? When found was at zero. > do both drbd live on the same physical device? > (physical meaning the same io-queue in linux, i.e. > the same /dev/sdX eventually, > when propagating down all the lvm layers, if any) Both DRBD resource partitions and both DRBD bitmaps live on the same devise: /dev/cciss, split into four partitions. This is a hardware RAID 5 devise from seven + one physical SAS disks. This has 256MB write-ahead-cache (with battery) and a tested write rate (bonnie++) of about 250MB/sec. I do not use the LVM system, if you mean the IBM piece of dynamic partitioning software ported onto Linux. > how many cpus? Four physical = eight logical on twin duel Zeon core. > how many pdflush threads (ps ax | grep pdflush) > during "lock", are one or more of those in "D" state? > if so, does it stay in "D" state? On hp-tm-02 (locked) # ps axl | grep pdflush 1 0 196 15 15 0 0 0 pdflus S ? 0:15 [pdflush] 1 0 197 15 15 0 0 0 sync_b D ? 1:34 [pdflush] Seems to move between S and D for the 197 pid, mostly S. > during lock, do (on both nodes) > ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd" > that should give some more information. hp-tm-02: (locked) 3788 S - drbd0_worker 3796 D drbd_md_sync_page_io drbd1_worker 3817 S - drbd0_receiver 3825 S - drbd1_receiver 4394 S - drbd0_asender 4395 S - drbd1_asender 2996 D sync_buffer find And: 197 D sync_buffer pdflush 959 D - reiserfs/3 3788 S - drbd0_worker 3796 S - drbd1_worker 3817 D drbd_wait_ee_list_empty drbd0_receiver 3825 S - drbd1_receiver 4394 S - drbd0_asender 4395 S - drbd1_asender But mostly: 3788 S - drbd0_worker 3796 S - drbd1_worker 3817 S - drbd0_receiver 3825 S - drbd1_receiver 4394 S - drbd0_asender 4395 S - drbd1_asender hp-tm-04: (partially locked) 14188 S - drbd0_worker 14194 S - drbd1_worker 14214 S - drbd0_receiver 14216 S - drbd0_asender 14223 S - drbd1_receiver 14225 S - drbd1_asender > during lock, does it help if you > drbdadm disconnect $resource ; sleep 3; drbdadm adjust $resource > (on one or the other node) I am sorry I can't disconnect these resources. > how frequently do you run into these locks? Depending on loading. I don't have much quantitative data. It seems to hit after about 4 days runtime, at least the last few times. Once the locking has started it will continue until some time after loading, say 30 minutes. But once hit, it will return frequently at lower load. Will continue on and off until (i) restart DRBD (ii) restart server. I am not sure at this stage which. > during lock, > what does "netstat -tnp" say (always on both nodes)? > (preferably grep for the drbd connection, > so something like > netstat -tnp | grep ':778[89] ' > if your drbd ports are configured to be 7788 and 7789. hp-tm-02: tcp 0 0 192.168.95.5:7788 192.168.95.6:45579 ESTABLISHED - tcp 0 0 192.168.95.5:7789 192.168.95.6:50365 ESTABLISHED - tcp 0 0 192.168.95.5:51501 192.168.95.6:7789 ESTABLISHED - tcp 0 0 192.168.95.5:54029 192.168.95.6:7788 ESTABLISHED - hp-tm-04: tcp 0 0 192.168.95.6:7788 192.168.95.5:54029 ESTABLISHED - tcp 0 0 192.168.95.6:7789 192.168.95.5:51501 ESTABLISHED - tcp 0 0 192.168.95.6:50365 192.168.95.5:7789 ESTABLISHED - tcp 0 0 192.168.95.6:45579 192.168.95.5:7788 ESTABLISHED - tcp 0 0 192.168.95.6:45579 192.168.95.5:7788 ESTABLISHED - Now that's odd, why should there be five? But repeat of test shows just four entries. # lsof -t :45579 # lsof -t :7788 (No return) I hope this is enough to be going on with. Please let me know if there is any more information I can supply. Lars Ellenberg wrote: > On Mon, Jan 07, 2008 at 03:16:04PM +0000, Ben Clewett wrote: >> >> Hi Lars, >> >> Thanks for taking an interest, I hope you find something. I will give >> you all the information I have: >> >> Kernel: 2.6.18.0 >> DRBD: 8.2.1 >> >> >> /proc/drbd before lock: >> >> version: 8.2.1 (api:86/proto:86-87) >> GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by >> root at hp-tm-02, 2007-12-10 22:21:14 >> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- >> ns:1053120 nr:28862256 dw:28864144 dr:1065701 al:46 bm:392 lo:1 pe:0 ua:1 ap:0 >> resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 changed:217 >> act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 changed:46 >> 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r--- >> ns:41783720 nr:1053224 dw:42836948 dr:7104597 al:144980 bm:422 lo:0 pe:0 ua:0 ap:0 >> resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 changed:213 >> act_log: used:1/257 hits:10300951 misses:145315 starving:0 dirty:335 changed:144980 >> >> /proc/drbd during lock: >> >> GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by >> root at hp-tm-02, 2007-12-10 22:21:14 >> 0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r--- >> ns:1053120 nr:78899808 dw:78901652 dr:1065701 al:46 bm:392 lo:12 pe:0 ua:1 ap:0 >> resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 changed:217 >> act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 changed:46 >> 1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r--- >> ns:169235760 nr:1053224 dw:170288980 dr:26714233 al:655079 bm:422 lo:2 pe:0 ua:0 ap:2 >> resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 changed:213 >> act_log: used:3/257 hits:41653862 misses:656540 starving:0 dirty:1461 changed:655079 >> >> ** No kernel messages. ** > > try to avoid line breaks :) > > and the other node, during "lock"? > (find information always on both nodes) > > the "lo:NNN", on the Secondary, does that change? > I mean, does this gauge change, > and eventually decrease to zero, during "lock"? > > do both drbd live on the same physical device? > (physical meaning the same io-queue in linux, i.e. > the same /dev/sdX eventually, > when propagating down all the lvm layers, if any) > > how many cpus? > how many pdflush threads (ps ax | grep pdflush) > during "lock", are one or more of those in "D" state? > if so, does it stay in "D" state? > >> It is unfortunate but I cannot take my servers off line for good testing. >> >> The only figure I can consistent measure is the 'iowait' reported by >> 'sar'. When I have a lock this will report ~ 30%, and ~ 1% when not >> locking. >> >> The iowait is listed in the 'sar' man page as 'Percentage of time that >> the CPU or CPUs were idle during which the system had an outstanding >> disk I/O request.' I guess this is a wait from the /dev/drbd0 device? >> >> If you know how I could break down 'iowait' I would be interested. > > during lock, do (on both nodes) > ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd" > that should give some more information. > >> During the lock the TCP IO drops considerably. This is the output from >> iftop before and during a lock: >> >> Before: >> 192.168.95.5 => 192.168.95.6 107Mb 80.0Mb 76.2Mb >> <= 4.98Mb 4.65Mb 4.61Mb >> >> During: >> 192.168.95.5 => 192.168.95.6 214Kb 1.30Mb 963Kb >> <= 8.74Mb 20.6Mb 10.9Mb > > you can further break down that to port numbers. > (press capital D and S) > if you tcpdump limiting to the drbd port(s), > what is still transmitted? > > during lock, does it help if you > drbdadm disconnect $resource ; sleep 3; drbdadm adjust $resource > (on one or the other node) > > how frequently do you run into these locks? > > during lock, > what does "netstat -tnp" say (always on both nodes)? > (preferably grep for the drbd connection, > so something like > netstat -tnp | grep ':778[89] ' > if your drbd ports are configured to be 7788 and 7789. > >> I hope there is something useful in this. >> >> Ben >> >> PS, is there a way of working DRBD using UDP? This would also fix this >> specific problem with the Broadcom network cards, where the problem is >> documented as specific to TCP. > > No. > And don't expect that too soon. for write ordering we would need to > reimplement the packet ordering logic of tcp for the udp connection. > why should we do that? > > that said, maybe you could do openvpn over udp, > then direct drbd to use some ip within that openvpn net, > which then does a tcp connection to the tun device of openvpn, > which encrypts/compresses/authenticates/whatever that packet, > and will send it as udp over to the other node. happy hacking :) > ************************************************************************* This e-mail is confidential and may be legally privileged. It is intended solely for the use of the individual(s) to whom it is addressed. Any content in this message is not necessarily a view or statement from Road Tech Computer Systems Limited but is that of the individual sender. If you are not the intended recipient, be advised that you have received this e-mail in error and that any use, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. We use reasonable endeavours to virus scan all e-mails leaving the company but no warranty is given that this e-mail and any attachments are virus free. You should undertake your own virus checking. The right to monitor e-mail communications through our networks is reserved by us Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley, Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17 Registered in England No: 02017435, Registered Address: Charter Court, Midland Road, Hemel Hempstead, Hertfordshire, HP2 5GE. ************************************************************************* -------------- next part -------------- A non-text attachment was scrubbed... Name: xload.png Type: image/png Size: 8663 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080108/51a19455/attachment.png>