[DRBD-user] Problem with private link and DRBD sync process

Saul, Markus Markus.Saul at danet.de
Thu Nov 23 15:58:50 CET 2006

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello list,
in the setup I run here I got a problem that drives me almost mad, when
doing a drbd sync after some time the connection will get dropped and
packet loss occurs. Two servers connected with a cluster link for
heartbeat and drbd.
Maybe someone has got an idea. The network "runs" fine for days
(heartbeat sends its heartbeat packets, but no real load is put on the
private link), but if you start up drbd on the second node...the link
will bog down after awhile, usually after some seconds 5-10, but this
varies). It also doesn't matter if heartbeat is also running or not.
I don't know if this a drbd issue..but maybe you can give me some sort
of hint..since it only shows when doing the drbd connect/sync... 


Configuration Details:

OS: Debian stable/testing
DRBD version is 0.7.21 (Debian 0.7.21-3 is installed), drbd config
attached, the same behaviour existed with 0.7.19, but I upgraded to
0.7.21
Kernel: Linux nagios1 2.6.15-1-686 #2 Mon Mar 6 15:27:08 UTC 2006 i686
GNU/Linux

nagios1 drbd interface: 192.168.1.1
nagios2 drbd interface: 192.168.1.2


The situation at start:

drbd status 

nagios1:/# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01
 0: cs:WFConnection st:Primary/Unknown ld:Consistent
    ns:540969 nr:0 dw:2193215 dr:12762931 al:3 bm:1345 lo:0 pe:0 ua:0
ap:0
 1: cs:Unconfigured

nagios2:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01
 0: cs:Unconfigured
 1: cs:Unconfigured


network connection between nodes is still good it seems:


normal ping:

nagios1:/# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.
[...]
--- 192.168.1.2 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9008ms
rtt min/avg/max/mdev = 0.086/0.178/0.306/0.065 ms


nagios2:~# ping 192.168.1.1
[...]
--- 192.168.1.1 ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.128/0.226/0.304/0.054 ms


large ping:

nagios1:/# ping -s 5000 -f 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 5000(5028) bytes of data.

--- 192.168.1.2 ping statistics ---
24512 packets transmitted, 24512 received, 0% packet loss, time 9867ms
rtt min/avg/max/mdev = 0.207/0.324/0.464/0.028 ms, pipe 2, ipg/ewma
0.402/0.319 ms

nagios2:~# ping -s 5000 -f 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data.
.
--- 192.168.1.1 ping statistics ---
10674 packets transmitted, 10673 received, 0% packet loss, time 9711ms
rtt min/avg/max/mdev = 0.211/0.329/0.459/0.029 ms, ipg/ewma 0.909/0.337
ms


OK...
let's start drbd on nagios2:

version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:53725 dw:53725 dr:0 al:0 bm:13 lo:0 pe:2560 ua:0 ap:0
        [>...................] sync'ed:  0.4% (15243/15296)M
        finish: 0:14:27 speed: 17,900 (17,900) K/sec
 1: cs:Unconfigured

...boom there the connection goes down the drain:

drbd0: resync bitmap: bits=9648394 words=301514
drbd0: size = 36 GB (38593576 KB)
drbd0: 14 GB marked out-of-sync by on disk bit-map.
drbd0: Found 4 transactions (26 active extents) in activity log.
drbd0: drbdsetup [4133]: cstate Unconfigured --> StandAlone
drbd0: drbdsetup [4146]: cstate StandAlone --> Unconnected
drbd0: drbd0_receiver [4147]: cstate Unconnected --> WFConnection
drbd0: drbd0_receiver [4147]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(S): 0:00000018:00000008:00000735:00000024:01
drbd0: Peer(P): 1:00000018:00000008:00000736:00000026:10
drbd0: drbd0_receiver [4147]: cstate WFReportParams --> WFBitMapT
drbd0: Secondary/Unknown --> Secondary/Primary
drbd0: drbd0_receiver [4147]: cstate WFBitMapT --> SyncTarget
drbd0: Resync started as SyncTarget (need to sync 15663456 KB [3915864
bits set]).
drbd0: PingAck did not arrive in time.
drbd0: drbd0_asender [4157]: cstate SyncTarget --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [4147]: cstate NetworkFailure --> BrokenPipe
drbd0: short read receiving data block: read 3408 expected 4096
drbd0: error receiving RSDataReply, l: 4112!
drbd0: worker terminated
drbd0: drbd0_receiver [4147]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [4147]: cstate Unconnected --> WFConnection


nagios2:~# cat /proc/drbd
version: 0.7.21 (api:79/proto:74)
SVN Revision: 2326 build by root at develop-pc, 2006-11-08 16:28:01
 0: cs:WFConnection st:Secondary/Unknown ld:Inconsistent
    ns:0 nr:53725 dw:53725 dr:0 al:0 bm:13 lo:0 pe:0 ua:0 ap:0
 1: cs:Unconfigured


drbd messages for nagios1 meanwhile:

drbd0: drbd0_receiver [31089]: cstate WFConnection --> WFReportParams
drbd0: Handshake successful: DRBD Network Protocol version 74
drbd0: Connection established.
drbd0: I am(P): 1:00000018:00000008:00000736:00000026:10
drbd0: Peer(S): 0:00000018:00000008:00000735:00000024:01
drbd0: drbd0_receiver [31089]: cstate WFReportParams --> WFBitMapS
drbd0: Primary/Unknown --> Primary/Secondary
drbd0: drbd0_receiver [31089]: cstate WFBitMapS --> SyncSource
drbd0: Resync started as SyncSource (need to sync 15663456 KB [3915864
bits set]).
drbd0: [drbd0_worker/7144] sock_sendmsg time expired, ko = 36
drbd0: [drbd0_worker/7144] sock_sendmsg time expired, ko = 35
drbd0: PingAck did not arrive in time.
drbd0: drbd0_asender [32313]: cstate SyncSource --> NetworkFailure
drbd0: asender terminated
drbd0: drbd0_receiver [31089]: cstate NetworkFailure --> BrokenPipe
drbd0: short read expecting header on sock: r=-512
drbd0: _drbd_send_page: size=4096 len=312 sent=-4
drbd0: drbd_send_block() failed
drbd0: worker terminated
drbd0: ASSERT( mdev->ee_in_use == 0 ) in
/usr/src/modules/drbd/drbd/drbd_receiver.c:1880
drbd0: drbd0_receiver [31089]: cstate BrokenPipe --> Unconnected
drbd0: Connection lost.
drbd0: drbd0_receiver [31089]: cstate Unconnected --> WFConnection



now for the pings again:

normal ping 

nagios1:/# ping 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 56(84) bytes of data.

--- 192.168.1.2 ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11011ms


nagios2:~# ping 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 56(84) bytes of data.
>From 192.168.1.2 icmp_seq=1 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=2 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=3 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=5 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=6 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=7 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=9 Destination Host Unreachable
>From 192.168.1.2 icmp_seq=10 Destination Host Unreachable

--- 192.168.1.1 ping statistics ---
11 packets transmitted, 0 received, +8 errors, 100% packet loss, time
10000ms
, pipe 4


large ping

nagios1:/# ping -s 5000 -f 192.168.1.2
PING 192.168.1.2 (192.168.1.2) 5000(5028) bytes of data.
........................................................................
........................................................................
........................................................................
........................................................................
........................................................................
................
--- 192.168.1.2 ping statistics ---
376 packets transmitted, 0 received, 100% packet loss, time 6064ms
, ipg/ewma 16.171/0.000 ms


nagios2:~# ping -s 5000 -f 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data.
........................................................................
........................................................................
........................................................................
........................................................................
........................................................................
..............................................
--- 192.168.1.1 ping statistics ---
406 packets transmitted, 0 received, 100% packet loss, time 6247ms
, ipg/ewma 15.425/0.000 ms



now the problem solving:

/etc/init.d/networking restart doesn't help on any node, still the same
behaviour

rebooting nagios1 doesn't help either, still the same behaviour

rebooting nagios2 solves the problem, good connectivity again (I
disabled DRBD startup on boot, otherwise, I'd have to reboot again..and
again...and so forth)


nagios2:~# ping -s 5000 -f 192.168.1.1
PING 192.168.1.1 (192.168.1.1) 5000(5028) bytes of data.
..
--- 192.168.1.1 ping statistics ---
2879 packets transmitted, 2877 received, 0% packet loss, time 8501ms
rtt min/avg/max/mdev = 0.210/0.320/0.439/0.033 ms, ipg/ewma 2.953/0.309
ms



Thanks for any hints,

 Markus







-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd.conf
Type: application/octet-stream
Size: 745 bytes
Desc: drbd.conf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20061123/290a5fb3/attachment.obj>


More information about the drbd-user mailing list