[DRBD-user] DRBD serious locking due to TOE - UPDATE

Ben Clewett ben at roadrunner.uk.com
Fri Jan 4 11:23:18 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.



Dear DRBD

Ref:- DRBD locking, suspect Broadcom NIC

For those of you having locking and following this thread, I have found 
another problem related to our HP Proliant servers.

These optionally come with a battery backed up write-ahead-cache on the 
cciss controller.  I have 256M of write ahead cache active.

I found this thread from Lars suggesting these items may have a problem 
where they miss-report storage of data:

http://lists.linbit.com/pipermail/drbd-user/2007-June/007033.html

I have set by DRBD parameters to match this posting.  So far no locking 
as occured.  But sometimes it takes a few days before these occure.

Therefore my locking problem seems to be related to three factors:
	NIC Firmware
	NIC driver (bnx2)
	cciss write-ahead-cache module.

I can't really give many figures as I am running live serves which I 
can't take offline to test.  The only figure I can consistent measure is 
the 'iowait' reported by sar.   When I have a lock this will report ~ 
30%, and ~ 1% when not locking.

The iowait is listed as 'Percentage of time that the CPU or CPUs were 
idle during which the system had an outstanding disk I/O request.'

If any members know how I can breakdown the iowait to something more 
specific, I would be very interested in knowing :)

Regards,

Ben



Ben Clewett wrote:
> 
> 
> Reg: Broadcom NICs causing DRBD to lock.
> 
> Hi Jure,
> 
> I wonder if you have made any further progress identifying this issue?
> 
> I thought the problem was solved by upgrading the firmware (to BC = 
> 1.9.6, iSCSI = 1.1.8, UMP = 1.1.8) and bnx2 driver (to 1.6.7b).  This 
> made a significant difference.
> 
> Until yesterday when I had a bad lock.  After significant writing 
> through MySql, both sides of my symmetric DRBD array went solid.  Load 
> stuck at about 2.0 which could not be accounted for my user processes, 
> throughput and replication speed dropped to about a tenth of normal 
> ability.  Usual solution:  Stop writing processes, wait for lock to 
> clear, then start again.
> 
> I tried:
> 
> # ethtool -K eth2 tso off
> 
> This had no effect.
> 
> I therefore have to conclude that there is still a problem somewhere, 
> although not as bad as with older firmware and bnx2 drivers.
> 
> Ben
> 
> 
> 
> 
> Jure Pečar wrote:
>> Coming late in this thread,  I too (just recently) identified problems 
>> with
>> drbd on HP DL385 machines with Broadcom NICs.
>>
>> The problem is easily repeatable even on minimal install: just enable 
>> xinetd
>> chargen on one machine and suck data from it from the other (something 
>> like
>> nc <ip> 19 > /dev/null) and watch iptraf -d on that interface on sending
>> machine. IP checksum error counter goes up at about 10% rate of the 
>> packet
>> count.
>>
>> The thing one has to turn off with ethtool is not ip checksum, but tcp
>> segmentation (ethtool -K eth1 tso off). This is what is really making
>> problems. I confirmed it on 1.4.43, 1.4.52d and 1.5.11-rh version of the
>> driver. In january, when I come back to work, I'll repeat tests with 
>> 1.6.7
>> to see if it makes any difference.
>>
>> IMHO it's a shame that neither Broadcom nor HP/Dell did not find such
>> problem in their QA tests and released such broken NICs to the market.
>>
>>
> 
> 
> *************************************************************************
> This e-mail is confidential and may be legally privileged. It is intended
> solely for the use of the individual(s) to whom it is addressed. Any
> content in this message is not necessarily a view or statement from Road
> Tech Computer Systems Limited but is that of the individual sender. If
> you are not the intended recipient, be advised that you have received
> this e-mail in error and that any use, dissemination, forwarding,
> printing, or copying of this e-mail is strictly prohibited. We use
> reasonable endeavours to virus scan all e-mails leaving the company but
> no warranty is given that this e-mail and any attachments are virus free.
> You should undertake your own virus checking. The right to monitor e-mail
> communications through our networks is reserved by us
> 
>  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
>  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
>  Registered in England No: 02017435, Registered Address: Charter Court, 
>  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
> *************************************************************************
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 


*************************************************************************
This e-mail is confidential and may be legally privileged. It is intended
solely for the use of the individual(s) to whom it is addressed. Any
content in this message is not necessarily a view or statement from Road
Tech Computer Systems Limited but is that of the individual sender. If
you are not the intended recipient, be advised that you have received
this e-mail in error and that any use, dissemination, forwarding,
printing, or copying of this e-mail is strictly prohibited. We use
reasonable endeavours to virus scan all e-mails leaving the company but
no warranty is given that this e-mail and any attachments are virus free.
You should undertake your own virus checking. The right to monitor e-mail
communications through our networks is reserved by us

  Road Tech Computer Systems Ltd. Shenley Hall, Rectory Lane, Shenley,
  Radlett, Hertfordshire, WD7 9AN. - VAT Registration No GB 449 3582 17
  Registered in England No: 02017435, Registered Address: Charter Court, 
  Midland Road, Hemel Hempstead,  Hertfordshire, HP2 5GE. 
*************************************************************************



More information about the drbd-user mailing list