[DRBD-user] drbd lockup over a wan

Lars Ellenberg Lars.Ellenberg at linbit.com
Tue Jan 23 21:23:07 CET 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


please.
when you start a new thread,
do not reply to an existing one.
please.

/ 2007-01-23 12:41:55 -0600
\ Christopher Harrison:
> I have been experiencing a problem where my drbd disk partitions get
> stuck in a network failure mode.   I am noticing a number of these
> errors in my dmesg and the system become unresponsive (due to processes
> waiting to write to disk and I am also running protocol C).   The system
> eventually get buried by processes waiting to write to disk and will
> eventually lock up.   Attached is my dmesg on the remote side before the
> system the primary system locks up completely.   I am theorizing that I
> may need to tweak my tcp setting to make the system behave more
> gracefully.   Has anyone else had a similar experience.   I am running
> over a 6Mbit bi directional link with a vpn tunnel built between sites.
> 
> My setup is a xen host running kernel version
> 2.6.17-1.2174_FC5xenU
> 
> drbd version:
> drbd-0.7.21-1
> drbd km version:
> drbd-km-2.6.17_1.2174_FC5xenU-0.7.21-1
> 
> I have three drbd partitions which reside on lvm's:
>  version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by harrison at dragon, 2006-09-07 09:26:55
>  0: cs:WFConnection st:Primary/Unknown ld:Consistent
>     ns:479044 nr:27495304 dw:40351000 dr:4944468 al:408 bm:839 lo:0 pe:0
> ua:0 ap:0
>  1: cs:WFConnection st:Primary/Unknown ld:Consistent
>     ns:654288 nr:44147520 dw:46264868 dr:2564263 al:123 bm:629 lo:1 pe:0
> ua:0 ap:1
>  2: cs:WFConnection st:Primary/Unknown ld:Consistent
>     ns:16 nr:234632 dw:234652 dr:423 al:1 bm:41 lo:0 pe:0 ua:0 ap:0
> 
> I have also included my drbd.conf.   I am at a lost to prevent or
> mitigate the lockup situation.   We have been running this configuration
> for the past 5 months but only within the past month have I been
> experiencing this problem.   There has been no config drbd changes to
> these systems.

maybe some network components start failing ...
(I do software; I obviously first point at the hardware :->)

> I am also experiencing the same problem with another server running kernel:
>  2.6.18-1.2257.fc5xenU
> drbd version:
> drbd-0.7.22-1
> drbd-km-2.6.18_1.2257.fc5xenU-0.7.22-1

increase the timeouts in drbd.

>   net {
>         ko-count                180;

thats huge. that means about (180 * ping-int)
 -> half an hour blocked io subsystem.

typical values are more in the area of, say, 3 to 7...

>         timeout                 60;
>         connect-int             10;
>         ping-int                10;

you want to increase all of these.
e.g. double them until you no longer see that connection problem,
then do a binary search between working and non-working...

>         on-disconnect           reconnect;
>   }
>   syncer {
>         rate                    20M;
>         group                   0;
>   }

what kind of wan link do you have?
is it capable of 20 MByte (~= 200 MBit) per second?


> 100]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams
> drbd0: sock was shut down by peer
> drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe
> drbd0: short read expecting header on sock: r=0
> drbd0: Network error during initial handshake. I'll try again.
> drbd0: worker terminated
> drbd0: drbd0_receiver [5100]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> drbd0: drbd0_receiver [5100]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams
> drbd0: sock was shut down by peer
> drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe
> drbd0: short read expecting header on sock: r=0

...

well, timestamps would be nice,
and a correlated log of both nodes would be better.

and just the first incident, everthing else is probably redundant
information, or even just a subsequent error...

but lacking contrary information, I read these logs as:
well, your network connection _is_ flaky.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list