Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
please.
when you start a new thread,
do not reply to an existing one.
please.
/ 2007-01-23 12:41:55 -0600
\ Christopher Harrison:
> I have been experiencing a problem where my drbd disk partitions get
> stuck in a network failure mode. I am noticing a number of these
> errors in my dmesg and the system become unresponsive (due to processes
> waiting to write to disk and I am also running protocol C). The system
> eventually get buried by processes waiting to write to disk and will
> eventually lock up. Attached is my dmesg on the remote side before the
> system the primary system locks up completely. I am theorizing that I
> may need to tweak my tcp setting to make the system behave more
> gracefully. Has anyone else had a similar experience. I am running
> over a 6Mbit bi directional link with a vpn tunnel built between sites.
>
> My setup is a xen host running kernel version
> 2.6.17-1.2174_FC5xenU
>
> drbd version:
> drbd-0.7.21-1
> drbd km version:
> drbd-km-2.6.17_1.2174_FC5xenU-0.7.21-1
>
> I have three drbd partitions which reside on lvm's:
> version: 0.7.21 (api:79/proto:74)
> SVN Revision: 2326 build by harrison at dragon, 2006-09-07 09:26:55
> 0: cs:WFConnection st:Primary/Unknown ld:Consistent
> ns:479044 nr:27495304 dw:40351000 dr:4944468 al:408 bm:839 lo:0 pe:0
> ua:0 ap:0
> 1: cs:WFConnection st:Primary/Unknown ld:Consistent
> ns:654288 nr:44147520 dw:46264868 dr:2564263 al:123 bm:629 lo:1 pe:0
> ua:0 ap:1
> 2: cs:WFConnection st:Primary/Unknown ld:Consistent
> ns:16 nr:234632 dw:234652 dr:423 al:1 bm:41 lo:0 pe:0 ua:0 ap:0
>
> I have also included my drbd.conf. I am at a lost to prevent or
> mitigate the lockup situation. We have been running this configuration
> for the past 5 months but only within the past month have I been
> experiencing this problem. There has been no config drbd changes to
> these systems.
maybe some network components start failing ...
(I do software; I obviously first point at the hardware :->)
> I am also experiencing the same problem with another server running kernel:
> 2.6.18-1.2257.fc5xenU
> drbd version:
> drbd-0.7.22-1
> drbd-km-2.6.18_1.2257.fc5xenU-0.7.22-1
increase the timeouts in drbd.
> net {
> ko-count 180;
thats huge. that means about (180 * ping-int)
-> half an hour blocked io subsystem.
typical values are more in the area of, say, 3 to 7...
> timeout 60;
> connect-int 10;
> ping-int 10;
you want to increase all of these.
e.g. double them until you no longer see that connection problem,
then do a binary search between working and non-working...
> on-disconnect reconnect;
> }
> syncer {
> rate 20M;
> group 0;
> }
what kind of wan link do you have?
is it capable of 20 MByte (~= 200 MBit) per second?
> 100]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams
> drbd0: sock was shut down by peer
> drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe
> drbd0: short read expecting header on sock: r=0
> drbd0: Network error during initial handshake. I'll try again.
> drbd0: worker terminated
> drbd0: drbd0_receiver [5100]: cstate BrokenPipe --> Unconnected
> drbd0: Connection lost.
> drbd0: drbd0_receiver [5100]: cstate Unconnected --> WFConnection
> drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams
> drbd0: sock was shut down by peer
> drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe
> drbd0: short read expecting header on sock: r=0
...
well, timestamps would be nice,
and a correlated log of both nodes would be better.
and just the first incident, everthing else is probably redundant
information, or even just a subsequent error...
but lacking contrary information, I read these logs as:
well, your network connection _is_ flaky.
--
: Lars Ellenberg Tel +43-1-8178292-0 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :
__
please use the "List-Reply" function of your email client.