Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
please. when you start a new thread, do not reply to an existing one. please. / 2007-01-23 12:41:55 -0600 \ Christopher Harrison: > I have been experiencing a problem where my drbd disk partitions get > stuck in a network failure mode. I am noticing a number of these > errors in my dmesg and the system become unresponsive (due to processes > waiting to write to disk and I am also running protocol C). The system > eventually get buried by processes waiting to write to disk and will > eventually lock up. Attached is my dmesg on the remote side before the > system the primary system locks up completely. I am theorizing that I > may need to tweak my tcp setting to make the system behave more > gracefully. Has anyone else had a similar experience. I am running > over a 6Mbit bi directional link with a vpn tunnel built between sites. > > My setup is a xen host running kernel version > 2.6.17-1.2174_FC5xenU > > drbd version: > drbd-0.7.21-1 > drbd km version: > drbd-km-2.6.17_1.2174_FC5xenU-0.7.21-1 > > I have three drbd partitions which reside on lvm's: > version: 0.7.21 (api:79/proto:74) > SVN Revision: 2326 build by harrison at dragon, 2006-09-07 09:26:55 > 0: cs:WFConnection st:Primary/Unknown ld:Consistent > ns:479044 nr:27495304 dw:40351000 dr:4944468 al:408 bm:839 lo:0 pe:0 > ua:0 ap:0 > 1: cs:WFConnection st:Primary/Unknown ld:Consistent > ns:654288 nr:44147520 dw:46264868 dr:2564263 al:123 bm:629 lo:1 pe:0 > ua:0 ap:1 > 2: cs:WFConnection st:Primary/Unknown ld:Consistent > ns:16 nr:234632 dw:234652 dr:423 al:1 bm:41 lo:0 pe:0 ua:0 ap:0 > > I have also included my drbd.conf. I am at a lost to prevent or > mitigate the lockup situation. We have been running this configuration > for the past 5 months but only within the past month have I been > experiencing this problem. There has been no config drbd changes to > these systems. maybe some network components start failing ... (I do software; I obviously first point at the hardware :->) > I am also experiencing the same problem with another server running kernel: > 2.6.18-1.2257.fc5xenU > drbd version: > drbd-0.7.22-1 > drbd-km-2.6.18_1.2257.fc5xenU-0.7.22-1 increase the timeouts in drbd. > net { > ko-count 180; thats huge. that means about (180 * ping-int) -> half an hour blocked io subsystem. typical values are more in the area of, say, 3 to 7... > timeout 60; > connect-int 10; > ping-int 10; you want to increase all of these. e.g. double them until you no longer see that connection problem, then do a binary search between working and non-working... > on-disconnect reconnect; > } > syncer { > rate 20M; > group 0; > } what kind of wan link do you have? is it capable of 20 MByte (~= 200 MBit) per second? > 100]: cstate Unconnected --> WFConnection > drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams > drbd0: sock was shut down by peer > drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe > drbd0: short read expecting header on sock: r=0 > drbd0: Network error during initial handshake. I'll try again. > drbd0: worker terminated > drbd0: drbd0_receiver [5100]: cstate BrokenPipe --> Unconnected > drbd0: Connection lost. > drbd0: drbd0_receiver [5100]: cstate Unconnected --> WFConnection > drbd0: drbd0_receiver [5100]: cstate WFConnection --> WFReportParams > drbd0: sock was shut down by peer > drbd0: drbd0_receiver [5100]: cstate WFReportParams --> BrokenPipe > drbd0: short read expecting header on sock: r=0 ... well, timestamps would be nice, and a correlated log of both nodes would be better. and just the first incident, everthing else is probably redundant information, or even just a subsequent error... but lacking contrary information, I read these logs as: well, your network connection _is_ flaky. -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.