[DRBD-user] "syncer" crash when doing full resync

Fri Jun 25 14:20:59 CEST 2004

Hi there!

I've been reading thru the ML for a while now and can't find anything
matching my problem, so i hope someone can help this way.

i'm using 2 "standard" hosts running 2.4.26 with the vserver ("ctx") patch
1.27.
one has an intel cpu, the other amd, but despite that, their configuration
is very similar:
.) 1 nic to the outside
.) 1 nic internal (10.99.99.1 + .2) w/ crossover cable
.) crossover serial cable connected to ttyS0 on both
.) 2 hdds (hda+hdc) with the following ptbl:

Disk /dev/hda: 120.0 GB, 120060444672 bytes
255 heads, 63 sectors/track, 14596 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
   Device Boot      Start         End      Blocks   Id  System
/dev/hda1               1         973     7815591   fd  Linux raid autodetect
/dev/hda2             974        1459     3903795   fd  Linux raid autodetect
/dev/hda3            1460        1708     2000092+  82  Linux swap
/dev/hda4            1709       14500   102751740   fd  Linux raid autodetect

the partitions are mounted as follows:
/dev/md1 on / type ext3 (rw)
/dev/md2 on /var type ext3 (rw)
/dev/md4 on /vservers type ext3 (rw) (yes i called it md4.)

both systems are running debian sarge. i now wanted to make md4 "a drbd
device" and used the following drbd.conf:
----------
global {
    minor_count=1
    # disable_io_hints
}

resource vs {
  protocol = C
  fsckcmd  = /bin/true
  # inittimeout=60
  # skip-wait
  # load-only
  # incon-degr-cmd=halt -f

  disk {
    do-panic
    disk-size = 51375808k
  }

  net {
    sndbuf-size = 512k
    skip-sync
    # sync-group = 1

    sync-min   = 10M   # syncer tries hard to not drop below this rate
    sync-max   = 25M # if you don't care about network saturation
    tl-size     = 5000  # transfer log size, ensures strict write ordering
    timeout     = 60    # unit: 0.1 seconds
    connect-int = 10    # unit: seconds
    ping-int    = 10    # unit: seconds
    ko-count    = 10    # if some block send times out this many times,
			# the peer is considered dead, even if it still
			# answeres ping requests
  }

  on cthon {
    device  = /dev/nb0
    disk    = /dev/md4
    address = 10.99.99.1
    port    = 7788
  }

  on nightmare {
    device  = /dev/nb0
    disk    = /dev/md4
    address = 10.99.99.2
    port    = 7788
  }
}
----------

i was first using "drbd-0.6.12.tar.gz" and then tried checking out the
current stable cvs, but no difference. my problem is that, when i run
/etc/init.d/drbd start on both boxes, everything first seems to work as it
should:

primary (cthon):
version: 0.6.12 (api:64/proto:62)
0: cs:Connected st:Primary/Secondary ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
        NEEDS_SYNC

secondary (nightmare):
version: 0.6.12 (api:64/proto:62)
0: cs:WFConnection st:Secondary/Unknown ns:0 nr:0 dw:0 dr:0 pe:0 ua:0
        NEEDS_SYNC      INCONSISTENT

then i start the resync process with:

cthon:~# drbdsetup /dev/nb0 replicate

it does work, stating
0: cs:SyncingAll st:Primary/Secondary ns:116020 nr:0 dw:0 dr:116224
pe:20469 ua:0
        [>...................] sync'ed:  0.3% (50057/50171)M
        finish: 4:02:59h speed: 3,529 (3,529) K/sec

BUT in the meantime the syncer crashed on the primary.. according to syslog:

Jun 25 14:10:33 cthon drbd: ===> drbd start <===
Jun 25 14:10:33 cthon drbd: modprobe -s drbd minor_count=1
Jun 25 14:10:33 cthon kernel: drbd: initialised. Version: 0.6.12
(api:64/proto:62)
Jun 25 14:10:33 cthon drbd: drbdsetup /dev/nb0 disk /dev/md4 --do-panic
--disk-size=51375808k
Jun 25 14:10:33 cthon drbd: drbdsetup /dev/nb0 net 10.99.99.1:7788
10.99.99.2:7788 C --sndbuf-size=512k --skip-sync --sync-min=10M -
-sync-max=25M --tl-size=5000 --timeout=60 --connect-int=10 --ping-int=10
--ko-count=10
Jun 25 14:10:33 cthon drbd: drbdsetup /dev/nb0 wait_connect -t 0
Jun 25 14:10:33 cthon kernel: drbd0: Connection established. size=51375808
KB / blksize=4096 B
Jun 25 14:10:33 cthon kernel: klogd 1.4.1, ---------- state change ----------
Jun 25 14:10:33 cthon kernel: Loaded 60 symbols from 1 module.
Jun 25 14:11:07 cthon kernel: drbd0: FULL Synchronisation started blks=64
Jun 25 14:11:10 cthon kernel: general protection fault: 0000
Jun 25 14:11:10 cthon kernel: CPU:    0
Jun 25 14:11:10 cthon kernel: EIP:    0010:[<de92d773>]    Not tainted
Jun 25 14:11:10 cthon kernel: EFLAGS: 00010286
Jun 25 14:11:10 cthon kernel: eax: ffffffff   ebx: 00000000   ecx:
00000001   edx: ffffffff
Jun 25 14:11:10 cthon kernel: esi: ffffffff   edi: dbd33f9c   ebp:
0000f200   esp: dbd33f44
Jun 25 14:11:10 cthon kernel: ds: 0018   es: 0018   ss: 0018
Jun 25 14:11:10 cthon kernel: Process drbdd_0 (pid: 638, stackpage=dbd33000)
Jun 25 14:11:10 cthon kernel: Stack: dbd33f14 00000001 00000000 00000000
00004100 00000286 00000000 dcc88000
Jun 25 14:11:10 cthon kernel:        dbd33f9c 00000000 de92aff9 ffffffff
0000f200 00000001 00000000 00000000
Jun 25 14:11:10 cthon kernel:        00000000 0000214a 00000000 dcc88000
dbf5c5a0 00000002 67027483 00000300
Jun 25 14:11:10 cthon kernel: Call Trace:    [<de92aff9>] [<de92b45a>]
[<de92f4db>] [<de926a8d>] [<c010738e>]
Jun 25 14:11:10 cthon kernel:   [<de926a60>]
Jun 25 14:11:10 cthon kernel:
Jun 25 14:11:10 cthon kernel: Code: 8b 06 0f b6 58 14 a1 e4 fb 92 de 69 db
dc 02 00 00 01 c3 31
Jun 25 14:11:21 cthon kernel:  <3>drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=9
Jun 25 14:11:24 cthon kernel: drbd0: ping ack did not arrive, trying to
reconnect
Jun 25 14:11:27 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=8
Jun 25 14:11:33 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=7
Jun 25 14:11:39 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=6
Jun 25 14:11:45 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=5
Jun 25 14:11:51 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=4
Jun 25 14:11:57 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=3
Jun 25 14:12:03 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=2
Jun 25 14:12:09 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
timeout count down: ko=1
Jun 25 14:12:14 cthon kernel: drbd0: [drbd_syncer_0/685] sock_sendmsg
returned -104
Jun 25 14:12:14 cthon kernel: drbd0: Syncer send failed.

this definitely doesn't look right to me =)
i noticed people were talking about the blksize, but how can i change that
one? am i overlooking something else? any help would be appreciated..

tia!

-- 
Markus Rambossek
MXR66-RIPE

mxr at mos.at <mailto:mxr at mos.at>
carrier66.net NetWork
DataCenter Wien
Shuttleworthstrasse 4-8
1210 Wien
Austria

Mobil: +43 650 4126691