Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello list, in the setup I run here I got a problem that drives me almost mad, when doing a drbd sync after some time the connection will get dropped and packet loss occurs. Two servers connected with a cluster link for heartbeat and drbd. Maybe someone has got an idea. The network "runs" fine for days (heartbeat sends its heartbeat packets, but no real load is put on the private link), but if you start up drbd on the second node...the link will bog down after awhile, usually after some seconds 5-10, but this varies). It also doesn't matter if heartbeat is also running or not. I don't know if this a drbd issue..but maybe you can give me some sort of hint..since it only shows when doing the drbd connect/sync... Configuration Details: OS: Debian stable/testing DRBD version is 0.7.21 (Debian 0.7.21-3 is installed), drbd config attached, the same behaviour existed with 0.7.19, but I upgraded to 0.7.21 Kernel: Linux nagios1 2.6.15-1-686 #2 Mon Mar 6 15:27:08 UTC 2006 i686 GNU/Linux nagios1 drbd interface: 192.168.1.1 nagios2 drbd interface: 192.168.1.2 The situation at start: drbd status nagios1:/# cat /proc/drbd version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01 0: cs:WFConnection st:Primary/Unknown ld:Consistent ns:540969 nr:0 dw:2193215 dr:12762931 al:3 bm:1345 lo:0 pe:0 ua:0 ap:0 1: cs:Unconfigured nagios2:~# cat /proc/drbd version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01 0: cs:Unconfigured 1: cs:Unconfigured network connection between nodes is still good it seems: normal ping: nagios1:/# ping 192.168.1.2 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. [...] --- 192.168.1.2 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9008ms rtt min/avg/max/mdev = 0.086/0.178/0.306/0.065 ms nagios2:~# ping 192.168.1.1 [...] --- 192.168.1.1 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.128/0.226/0.304/0.054 ms large ping: nagios1:/# ping -s 5000 -f 192.168.1.2 PING 192.168.1.2 (192.168.1.2) 5000(5028) bytes of data. --- 192.168.1.2 ping statistics --- 24512 packets transmitted, 24512 received, 0% packet loss, time 9867ms rtt min/avg/max/mdev = 0.207/0.324/0.464/0.028 ms, pipe 2, ipg/ewma 0.402/0.319 ms nagios2:~# ping -s 5000 -f 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data. . --- 192.168.1.1 ping statistics --- 10674 packets transmitted, 10673 received, 0% packet loss, time 9711ms rtt min/avg/max/mdev = 0.211/0.329/0.459/0.029 ms, ipg/ewma 0.909/0.337 ms OK... let's start drbd on nagios2: version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent ns:0 nr:53725 dw:53725 dr:0 al:0 bm:13 lo:0 pe:2560 ua:0 ap:0 [>...................] sync'ed: 0.4% (15243/15296)M finish: 0:14:27 speed: 17,900 (17,900) K/sec 1: cs:Unconfigured ...boom there the connection goes down the drain: drbd0: resync bitmap: bits=9648394 words=301514 drbd0: size = 36 GB (38593576 KB) drbd0: 14 GB marked out-of-sync by on disk bit-map. drbd0: Found 4 transactions (26 active extents) in activity log. drbd0: drbdsetup [4133]: cstate Unconfigured --> StandAlone drbd0: drbdsetup [4146]: cstate StandAlone --> Unconnected drbd0: drbd0_receiver [4147]: cstate Unconnected --> WFConnection drbd0: drbd0_receiver [4147]: cstate WFConnection --> WFReportParams drbd0: Handshake successful: DRBD Network Protocol version 74 drbd0: Connection established. drbd0: I am(S): 0:00000018:00000008:00000735:00000024:01 drbd0: Peer(P): 1:00000018:00000008:00000736:00000026:10 drbd0: drbd0_receiver [4147]: cstate WFReportParams --> WFBitMapT drbd0: Secondary/Unknown --> Secondary/Primary drbd0: drbd0_receiver [4147]: cstate WFBitMapT --> SyncTarget drbd0: Resync started as SyncTarget (need to sync 15663456 KB [3915864 bits set]). drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [4157]: cstate SyncTarget --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [4147]: cstate NetworkFailure --> BrokenPipe drbd0: short read receiving data block: read 3408 expected 4096 drbd0: error receiving RSDataReply, l: 4112! drbd0: worker terminated drbd0: drbd0_receiver [4147]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [4147]: cstate Unconnected --> WFConnection nagios2:~# cat /proc/drbd version: 0.7.21 (api:79/proto:74) SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01 0: cs:WFConnection st:Secondary/Unknown ld:Inconsistent ns:0 nr:53725 dw:53725 dr:0 al:0 bm:13 lo:0 pe:0 ua:0 ap:0 1: cs:Unconfigured drbd messages for nagios1 meanwhile: drbd0: drbd0_receiver [31089]: cstate WFConnection --> WFReportParams drbd0: Handshake successful: DRBD Network Protocol version 74 drbd0: Connection established. drbd0: I am(P): 1:00000018:00000008:00000736:00000026:10 drbd0: Peer(S): 0:00000018:00000008:00000735:00000024:01 drbd0: drbd0_receiver [31089]: cstate WFReportParams --> WFBitMapS drbd0: Primary/Unknown --> Primary/Secondary drbd0: drbd0_receiver [31089]: cstate WFBitMapS --> SyncSource drbd0: Resync started as SyncSource (need to sync 15663456 KB [3915864 bits set]). drbd0: [drbd0_worker/7144] sock_sendmsg time expired, ko = 36 drbd0: [drbd0_worker/7144] sock_sendmsg time expired, ko = 35 drbd0: PingAck did not arrive in time. drbd0: drbd0_asender [32313]: cstate SyncSource --> NetworkFailure drbd0: asender terminated drbd0: drbd0_receiver [31089]: cstate NetworkFailure --> BrokenPipe drbd0: short read expecting header on sock: r=-512 drbd0: _drbd_send_page: size=4096 len=312 sent=-4 drbd0: drbd_send_block() failed drbd0: worker terminated drbd0: ASSERT( mdev->ee_in_use == 0 ) in /usr/src/modules/drbd/drbd/drbd_receiver.c:1880 drbd0: drbd0_receiver [31089]: cstate BrokenPipe --> Unconnected drbd0: Connection lost. drbd0: drbd0_receiver [31089]: cstate Unconnected --> WFConnection now for the pings again: normal ping nagios1:/# ping 192.168.1.2 PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data. --- 192.168.1.2 ping statistics --- 12 packets transmitted, 0 received, 100% packet loss, time 11011ms nagios2:~# ping 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data. >From 192.168.1.2 icmp_seq=1 Destination Host Unreachable >From 192.168.1.2 icmp_seq=2 Destination Host Unreachable >From 192.168.1.2 icmp_seq=3 Destination Host Unreachable >From 192.168.1.2 icmp_seq=5 Destination Host Unreachable >From 192.168.1.2 icmp_seq=6 Destination Host Unreachable >From 192.168.1.2 icmp_seq=7 Destination Host Unreachable >From 192.168.1.2 icmp_seq=9 Destination Host Unreachable >From 192.168.1.2 icmp_seq=10 Destination Host Unreachable --- 192.168.1.1 ping statistics --- 11 packets transmitted, 0 received, +8 errors, 100% packet loss, time 10000ms , pipe 4 large ping nagios1:/# ping -s 5000 -f 192.168.1.2 PING 192.168.1.2 (192.168.1.2) 5000(5028) bytes of data. ........................................................................ ........................................................................ ........................................................................ ........................................................................ ........................................................................ ................ --- 192.168.1.2 ping statistics --- 376 packets transmitted, 0 received, 100% packet loss, time 6064ms , ipg/ewma 16.171/0.000 ms nagios2:~# ping -s 5000 -f 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data. ........................................................................ ........................................................................ ........................................................................ ........................................................................ ........................................................................ .............................................. --- 192.168.1.1 ping statistics --- 406 packets transmitted, 0 received, 100% packet loss, time 6247ms , ipg/ewma 15.425/0.000 ms now the problem solving: /etc/init.d/networking restart doesn't help on any node, still the same behaviour rebooting nagios1 doesn't help either, still the same behaviour rebooting nagios2 solves the problem, good connectivity again (I disabled DRBD startup on boot, otherwise, I'd have to reboot again..and again...and so forth) nagios2:~# ping -s 5000 -f 192.168.1.1 PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data. .. --- 192.168.1.1 ping statistics --- 2879 packets transmitted, 2877 received, 0% packet loss, time 8501ms rtt min/avg/max/mdev = 0.210/0.320/0.439/0.033 ms, ipg/ewma 2.953/0.309 ms Thanks for any hints, Markus -------------- next part -------------- A non-text attachment was scrubbed... Name: drbd.conf Type: application/octet-stream Size: 745 bytes Desc: drbd.conf URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20061123/290a5fb3/attachment.obj>