[DRBD-user] DRBD serious locking due to TOE - UPDATE

Tue Jan 8 10:06:44 CET 2008

Dear Lars,

I found a server at lock when I got to my desk this morning.  Not wanting to
waist any time, these are the numbers you ask for.

Lock on 'hp-tm-02', twin with 'hp-tm-04' which is partially locked.

I use the term 'lock' to explain a server with high load and very much
reduced throughput.

This is best shown with an xload plot.  The plot encloses shows the servers
side-by-side.  Second is in lock (hp-tm-02), first is twin in partial lock
(hp-tm-04).  A normal xload plot would not show any load above 1.0 on either server.

Sar on hp-tm-02: (locked)  Note the 'iowait' which is best indicator of a 'lock'.

# sar -u 2 10
Linux 2.6.18.8-0.7-default (hp-tm-02)   01/08/08

08:22:26          CPU     %user     %nice   %system   %iowait     %idle
08:22:28          all      0.75      0.00      0.50     24.84     73.91
08:22:30          all      2.50      0.00      0.88     44.75     51.88
08:22:32          all      0.25      0.00      0.25     40.02     59.48
08:22:34          all      0.75      0.00      0.00     25.06     74.19
08:22:36          all      0.50      0.00      0.37     25.34     73.78
08:22:38          all      3.50      0.00      1.00     36.33     59.18
08:22:40          all      7.60      0.00      1.25     28.64     62.52
08:22:42          all      1.87      0.00      0.50     25.94     71.70
08:22:44          all      0.62      0.00      1.62     42.09     55.67
08:22:46          all      0.50      0.00      0.50     49.94     49.06
Average:          all      1.88      0.00      0.69     34.29     63.14

Sar on hp-tm-04: (partially locked)

# sar -u 2 10
Linux 2.6.18.8-0.7-default (hp-tm-04)   01/08/08

08:22:37          CPU     %user     %nice   %system   %iowait     %idle
08:22:39          all     24.53      0.00      1.87      3.99     69.61
08:22:41          all     17.31      0.00      1.25      8.91     72.52
08:22:43          all      6.52      0.00      1.38     30.20     61.90
08:22:45          all      4.87      0.00      2.12     20.97     72.03
08:22:47          all      3.87      0.00      0.25     21.45     74.44
08:22:49          all      3.12      0.00      0.50     22.07     74.31
08:22:51          all      0.50      0.00      0.12     24.84     74.53
08:22:53          all      0.62      0.00      0.37     24.94     74.06
08:22:55          all      5.98      0.00      0.75     25.78     67.50
08:22:57          all      8.74      0.00      0.12     16.98     74.16
Average:          all      7.60      0.00      0.87     20.01     71.51

 > try to avoid line breaks :)

Very sorry, my email client.  I have adjusted to see if this helps.

hp-tm-02:  (lock)

version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by root at hp-tm-02, 2007-12-19 22:25:46
  0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r---
     ns:0 nr:91896348 dw:91896340 dr:0 al:0 bm:0 lo:2 pe:0 ua:1 ap:0
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0
  1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r---
     ns:125994544 nr:0 dw:125994540 dr:57151581 al:477198 bm:0 lo:2 pe:0 ua:0 ap:2
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:2/257 hits:31021438 misses:483228 starving:102 dirty:5998 changed:477198

 > and the other node, during "lock"?
 > (find information always on both nodes)

hp-tm-04: (partial lock)

version: 8.2.1 (api:86/proto:86-87)
GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by root at hp-tm-04, 2007-12-19 22:26:42
  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r---
     ns:91909804 nr:0 dw:91909804 dr:29581641 al:477780 bm:0 lo:0 pe:2 ua:0 ap:1
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:1/257 hits:22499671 misses:478376 starving:0 dirty:596 changed:477780
  1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r---
     ns:0 nr:125997044 dw:125997044 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
         resync: used:0/31 hits:0 misses:0 starving:0 dirty:0 changed:0
         act_log: used:0/257 hits:0 misses:0 starving:0 dirty:0 changed:0

 > the "lo:NNN", on the Secondary, does that change?
 > I mean, does this gauge change,
 > and eventually decrease to zero, during "lock"?

When found was at zero.

 > do both drbd live on the same physical device?
 > (physical meaning the same io-queue in linux, i.e.
 > the same /dev/sdX eventually,
 > when propagating down all the lvm layers, if any)

Both DRBD resource partitions and both DRBD bitmaps live on the same devise:
/dev/cciss, split into four partitions.  This is a hardware RAID 5 devise from
seven + one physical SAS disks.  This has 256MB write-ahead-cache (with battery)
and a tested write rate (bonnie++) of about 250MB/sec.

I do not use the LVM system, if you mean the IBM piece of dynamic
partitioning software ported onto Linux.

 > how many cpus?

Four physical = eight logical on twin duel Zeon core.

 > how many pdflush threads (ps ax | grep pdflush)
 > during "lock", are one or more of those in "D" state?
 > if so, does it stay in "D" state?

On hp-tm-02 (locked)

# ps axl | grep pdflush
1     0   196    15  15   0      0     0 pdflus S    ?          0:15 [pdflush]
1     0   197    15  15   0      0     0 sync_b D    ?          1:34 [pdflush]

Seems to move between S and D for the 197 pid, mostly S.

 > during lock, do (on both nodes)
 >  ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd"
 > that should give some more information.

hp-tm-02:  (locked)

  3788 S -                                        drbd0_worker
  3796 D drbd_md_sync_page_io                     drbd1_worker
  3817 S -                                        drbd0_receiver
  3825 S -                                        drbd1_receiver
  4394 S -                                        drbd0_asender
  4395 S -                                        drbd1_asender
  2996 D sync_buffer                              find

And:

   197 D sync_buffer                              pdflush
   959 D -                                        reiserfs/3
  3788 S -                                        drbd0_worker
  3796 S -                                        drbd1_worker
  3817 D drbd_wait_ee_list_empty                  drbd0_receiver
  3825 S -                                        drbd1_receiver
  4394 S -                                        drbd0_asender
  4395 S -                                        drbd1_asender

But mostly:

  3788 S -                                        drbd0_worker
  3796 S -                                        drbd1_worker
  3817 S -                                        drbd0_receiver
  3825 S -                                        drbd1_receiver
  4394 S -                                        drbd0_asender
  4395 S -                                        drbd1_asender

hp-tm-04:  (partially locked)

14188 S -                                        drbd0_worker
14194 S -                                        drbd1_worker
14214 S -                                        drbd0_receiver
14216 S -                                        drbd0_asender
14223 S -                                        drbd1_receiver
14225 S -                                        drbd1_asender

 > during lock, does it help if you
 > drbdadm disconnect $resource ; sleep 3; drbdadm adjust $resource
 > (on one or the other node)

I am sorry I can't disconnect these resources.

 > how frequently do you run into these locks?

Depending on loading.  I don't have much quantitative data.  It seems
to hit after about 4 days runtime, at least the last few times.  Once
the locking has started it will continue until some time after loading,
say 30 minutes.  But once hit, it will return frequently at lower load.
Will continue on and off until (i) restart DRBD (ii) restart server.
I am not sure at this stage which.

 > during lock,
 > what does "netstat -tnp" say (always on both nodes)?
 > (preferably grep for the drbd connection,
 >  so something like
 >  netstat -tnp | grep ':778[89] '
 >  if your drbd ports are configured to be 7788 and 7789.

hp-tm-02:

tcp        0      0 192.168.95.5:7788       192.168.95.6:45579      ESTABLISHED -
tcp        0      0 192.168.95.5:7789       192.168.95.6:50365      ESTABLISHED -
tcp        0      0 192.168.95.5:51501      192.168.95.6:7789       ESTABLISHED -
tcp        0      0 192.168.95.5:54029      192.168.95.6:7788       ESTABLISHED -

hp-tm-04:

tcp        0      0 192.168.95.6:7788       192.168.95.5:54029      ESTABLISHED -
tcp        0      0 192.168.95.6:7789       192.168.95.5:51501      ESTABLISHED -
tcp        0      0 192.168.95.6:50365      192.168.95.5:7789       ESTABLISHED -
tcp        0      0 192.168.95.6:45579      192.168.95.5:7788       ESTABLISHED -
tcp        0      0 192.168.95.6:45579      192.168.95.5:7788       ESTABLISHED -

Now that's odd, why should there be five?   But repeat of test shows
just four entries.

# lsof -t :45579
# lsof -t :7788

(No return)

I hope this is enough to be going on with.  Please let me know if there
is any more information I can supply.

Lars Ellenberg wrote:
> On Mon, Jan 07, 2008 at 03:16:04PM +0000, Ben Clewett wrote:
>>
>> Hi Lars,
>>
>> Thanks for taking an interest, I hope you find something.  I will give 
>> you all the information I have:
>>
>> Kernel:	2.6.18.0
>> DRBD:	8.2.1
>>
>>
>> /proc/drbd before lock:
>>
>> version: 8.2.1 (api:86/proto:86-87)
>> GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by 
>> root at hp-tm-02, 2007-12-10 22:21:14
>>  0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r---
>>     ns:1053120 nr:28862256 dw:28864144 dr:1065701 al:46 bm:392 lo:1 pe:0 ua:1 ap:0
>>         resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 changed:217
>>         act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 changed:46
>>  1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r---
>>     ns:41783720 nr:1053224 dw:42836948 dr:7104597 al:144980 bm:422 lo:0 pe:0 ua:0 ap:0
>>         resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 changed:213
>>         act_log: used:1/257 hits:10300951 misses:145315 starving:0 dirty:335 changed:144980
>>
>> /proc/drbd during lock:
>>
>> GIT-hash: 318925802fc2638479ad090b73d7af45503dd184 build by 
>> root at hp-tm-02, 2007-12-10 22:21:14
>>  0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate B r---
>>     ns:1053120 nr:78899808 dw:78901652 dr:1065701 al:46 bm:392 lo:12 pe:0 ua:1 ap:0
>>         resync: used:0/31 hits:65607 misses:217 starving:0 dirty:0 changed:217
>>         act_log: used:1/257 hits:427 misses:46 starving:0 dirty:0 changed:46
>>  1: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate B r---
>>     ns:169235760 nr:1053224 dw:170288980 dr:26714233 al:655079 bm:422 lo:2 pe:0 ua:0 ap:2
>>         resync: used:0/31 hits:65595 misses:213 starving:0 dirty:0 changed:213
>>         act_log: used:3/257 hits:41653862 misses:656540 starving:0 dirty:1461 changed:655079
>>
>> ** No kernel messages. **
> 
> try to avoid line breaks :)
> 
> and the other node, during "lock"?
> (find information always on both nodes)
> 
> the "lo:NNN", on the Secondary, does that change?
> I mean, does this gauge change,
> and eventually decrease to zero, during "lock"?
> 
> do both drbd live on the same physical device?
> (physical meaning the same io-queue in linux, i.e.
>  the same /dev/sdX eventually,
>  when propagating down all the lvm layers, if any)
> 
> how many cpus?
> how many pdflush threads (ps ax | grep pdflush)
> during "lock", are one or more of those in "D" state?
> if so, does it stay in "D" state?
> 
>> It is unfortunate but I cannot take my servers off line for good testing.
>>
>> The only figure I can consistent measure is the 'iowait' reported by 
>> 'sar'.   When I have a lock this will report ~ 30%, and ~ 1% when not 
>> locking.
>>
>> The iowait is listed in the 'sar' man page as 'Percentage of time that 
>> the CPU or CPUs were idle during which the system had an outstanding 
>> disk I/O request.'  I guess this is a wait from the /dev/drbd0 device?
>>
>> If you know how I could break down 'iowait' I would be interested.
> 
> during lock, do (on both nodes)
>  ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd"
> that should give some more information.
> 
>> During the lock the TCP IO drops considerably.  This is the output from 
>> iftop before and during a lock:
>>
>> Before:
>> 192.168.95.5          => 192.168.95.6           107Mb  80.0Mb  76.2Mb
>>                       <=                       4.98Mb  4.65Mb  4.61Mb
>>
>> During:
>> 192.168.95.5          => 192.168.95.6           214Kb  1.30Mb   963Kb
>>                       <=                       8.74Mb  20.6Mb  10.9Mb
> 
> you can further break down that to port numbers.
> (press capital D and S)
> if you tcpdump limiting to the drbd port(s),
> what is still transmitted?
> 
> during lock, does it help if you
>  drbdadm disconnect $resource ; sleep 3; drbdadm adjust $resource
> (on one or the other node)
> 
> how frequently do you run into these locks? 
> 
> during lock,
> what does "netstat -tnp" say (always on both nodes)?
> (preferably grep for the drbd connection,
>  so something like
>   netstat -tnp | grep ':778[89] '
>  if your drbd ports are configured to be 7788 and 7789.
> 
>> I hope there is something useful in this.
>>
>> Ben
>>
>> PS, is there a way of working DRBD using UDP?  This would also fix this 
>> specific problem with the Broadcom network cards, where the problem is 
>> documented as specific to TCP.
> 
> No.
> And don't expect that too soon.  for write ordering we would need to
> reimplement the packet ordering logic of tcp for the udp connection.
> why should we do that?
> 
> that said, maybe you could do openvpn over udp,
> then direct drbd to use some ip within that openvpn net,
> which then does a tcp connection to the tun device of openvpn,
> which encrypts/compresses/authenticates/whatever that packet,
> and will send it as udp over to the other node.  happy hacking :)
> 

*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
  Registered in England No: 02017435, Registered Address: Charter Court, 
  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
*************************************************************************
-------------- next part --------------
A non-text attachment was scrubbed...
Name: xload.png
Type: image/png
Size: 8663 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080108/51a19455/attachment.png>