[DRBD-user] DRBD over long distances

Fri Oct 12 12:13:14 CEST 2007

Hi,

we are currently using DRBD very succesfully, mirroring several 
TB of data for some time now. (At this point let me thank you 
for this really great product :)).

Now we seek to establish disaster redundancy for some data by 
sharing it between two different datacenters preferably using DRBD.
I am currently testing a cluster with nodes separated a few km 
using Protocol A/B/C (ping times around 1 ms) with drbd 8.0.6 
successfully. 

To test larger distances with higher latencies we tried to let drbd run 
via some TCP-proxies onto a third machine forwarding between the 
real drbd hosts (for example using ssh-forwarders, or other userspace 
relay-utilities). Using protocol C this works usually very good but due 
to rising latencies write access gets slower and slower (up to factor 25
with a 10 ms delay for each tcp-packet through the proxy).

To overcome this we tried to use protocol A as it is more important 
to us to have some data than to be synchronous. 
Using A on a low latency network (LAN) works unproblematic. While 
using higher latency setups initial connection setup (like bitmap 
exchange) usually works (the connect seems to be more difficult)
we get a lot 'BrokenPipe' states while data is transferred (often 
after a few bytes). The latter usually only gets fixed by disconnecting 
drbd on one host (starting the cyle all over).

Sometimes we get  
[6214.280107] drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync().

while usually drbd just constantly tries to reconnect:
[6220.255205] drbd0: sock was shut down by peer
[6220.255217] drbd0: conn( WFReportParams -> BrokenPipe ) 
[6220.255230] drbd0: short read expecting header on sock: r=0
[6220.255236] drbd0: My msock connect got accepted onto peer's sock!
[6226.251812] drbd0: tl_clear()
[6226.251818] drbd0: Connection closed
[6226.251829] drbd0: conn( BrokenPipe -> Unconnected ) 
[6226.251841] drbd0: conn( Unconnected -> WFConnection ) 
[6226.295766] drbd0: conn( WFConnection -> WFReportParams ) 
[6226.296156] drbd0: sock was shut down by peer
[6226.296164] drbd0: conn( WFReportParams -> BrokenPipe ) 
[6226.296172] drbd0: short read expecting header on sock: r=0
[6226.296178] drbd0: My msock connect got accepted onto peer's sock!
[6232.288608] drbd0: tl_clear()
(( repeating until drbd disconnect r0 ))

I was wondering if someone has experience running DRBD on long distances or 
high latencies link with ping times up to tens of ms possibly with protocoll A?

Any hints or tips to tune such setups are welcome.

Jens

-- 
jens.beyer at 1und1.de