[DRBD-user] Re: drbd 0.7.21(kernel 2.6.17) hang problem

Wed Sep 20 18:22:12 CEST 2006

Lars Ellenberg wrote:
> / 2006-09-19 16:44:14 +0200
> \ Maciej Bogucki:
> 
>>Hello,
>>
>>I've been using drbd for the past few years without any problems, but now my problem is a strange one.
>>I have HA mysql server with drbd 0.7.21 and kernel 2.6.17. When mysql
>>Partition (mysql databases) is on drbd device(datadb), server gets
>>lags - no response for 1-4seconds. It isn't network related problem(there in no packet lost!!), but
>>console (keyboard and monitor are directly connected) and all processes on server hangs too! The problem only 
>>apears when mysql database is on drbd device, and everything is working fine when I move data to non drbd 
>>device(sda). So I'm sure that it is drbd or kernel problem. When I migrate mysql to secondary machine, I have the 
>>same problems, so I think that hardware is ok  :)
>>The same problem is with drbd 0.7.17 with kernel version 2.6.14.
>>The strangest thing is, that I have the same hardware and
>>Software (drbd,kernel) in another location and there is no problems. One
>>change is that there I have apache instead of mysql.
>>Any ideas?
> 
> 
> outside drbd:
> verify what io scheduler you use.
> I'd recommend to use "deadline" on servers.
I have had "cfq" scheduler, but I changed it do "deadline", and I still 
have lags :(

> 
> in drbd:
> you could play with "unplug-watermark" and "max-epoch-size" (and
> possibly max-buffers).
> when I say "play", I mean it.  it could get better if you increase,
> it could get better when you decrease, it could get better if you
> adjust in opposite directions (where possible), and it could happen to
> have no noticable effect at all, which is all very dependent on your
> lower level io subsystem and on network timings and ...
I know, than I can play with them, but there is another strange thing. 
When I disconnect secondary node(shutdown heartbeat, and drbd) I get 
lags also. Also I don't have much traffic on database(1256 writes per 
minute - so I's 20KB per seconds, and only a few reads per minute), so 
playing with "net" parameters is not necessary in my case. I think that 
it is drbd bug or some stupid thing :)

>>resource datafs {
>>  protocol C;
>>  incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
> 
> 
> I know the "halt -f" is in the example config, but you may want to
> consider to write something like "sleep <verylargenumber>" or
> "killall -9 heartbeat ccm ipfail" instead...
But when I do like You write, there is a higher chance that I get split 
brain. When I do "halt -f" the chance is minimal.

Best Regards
Maciej Bogucki