Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
I'm having problems with DRBD getting stuck at around 99-100% during an
initial/full sync. This seems to be happening about 8 out of 10 times. If
I do "drbdadm down all" on both sides and then "drbdadm up all", both
nodes connect just fine and both end up in a consistent state. But for
some reason drbd will not by itself detect that the sync has actually
completed. This is what it looks like when they get stuck:
Proc1:~ # cat /proc/drbd
version: 0.7.4 (api:76/proto:74)
SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
0: cs:SyncSource st:Primary/Secondary ld:Consistent
ns:60558616 nr:0 dw:360 dr:60558461 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0
[===================>] sync'ed: 99.6% (248/59387)M
finish: 4:45:21 speed: 12 (10,488) K/sec
Proc2:~ # cat /proc/drbd
version: 0.7.4 (api:76/proto:74)
SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
ns:0 nr:60558616 dw:60558616 dr:0 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0
[===================>] sync'ed:100.0% (0/59139)M
finish: 0:00:00 speed: 16 (10,480) K/sec
/var/log/messages on Proc1:
...
Nov 25 11:07:02 Proc1 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74)
Nov 25 11:07:02 Proc1 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
Nov 25 11:07:02 Proc1 kernel: drbd: registered as block device major 147
Nov 25 11:07:02 Proc1 kernel: drbd0: resync bitmap: bits=1251563 words=39112
Nov 25 11:07:02 Proc1 kernel: drbd0: size = 4888 MB (5006250 KB)
Nov 25 11:07:02 Proc1 kernel: drbd0: 248 MB marked out-of-sync by on disk bit-map.
Nov 25 11:07:02 Proc1 kernel: drbd0: Found 4 transactions (64 active extents) in activity log.
Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1094]: cstate Unconfigured --> StandAlone
Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1096]: cstate StandAlone --> Unconnected
Nov 25 11:07:02 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate Unconnected --> WFConnection
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFConnection --> WFReportParams
Nov 25 11:07:03 Proc1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 25 11:07:03 Proc1 kernel: drbd0: resync bitmap: bits=16391168 words=512224
Nov 25 11:07:03 Proc1 kernel: drbd0: size = 62 GB (65564672 KB)
Nov 25 11:07:03 Proc1 kernel: drbd0: Connection established.
Nov 25 11:07:03 Proc1 kernel: drbd0: I am(S): 1:00000005:00000003:00000091:0000004e:00
Nov 25 11:07:03 Proc1 kernel: drbd0: Peer(S): 1:00000005:00000003:00000090:0000004e:00
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFReportParams --> WFBitMapS
Nov 25 11:07:03 Proc1 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFBitMapS --> SyncSource
Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]).
Nov 25 11:07:18 Proc1 kernel: drbd0: Secondary/Secondary --> Primary/Secondary
Nov 25 11:15:38 Proc1 kernel: drbd0: [drbd0_worker/1095] sock_sendmsg time expired, ko = 4294967295
/var/log/messages on Proc2:
...
Nov 25 11:07:02 Proc2 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74)
Nov 25 11:07:02 Proc2 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
Nov 25 11:07:02 Proc2 kernel: drbd: registered as block device major 147
Nov 25 11:07:02 Proc2 kernel: drbd0: resync bitmap: bits=1251563 words=39112
Nov 25 11:07:02 Proc2 kernel: drbd0: size = 4888 MB (5006250 KB)
Nov 25 11:07:02 Proc2 kernel: drbd0: 80 KB marked out-of-sync by on disk bit-map.
Nov 25 11:07:02 Proc2 kernel: drbd0: Found 4 transactions (52 active extents) in activity log.
Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1105]: cstate Unconfigured --> StandAlone
Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1107]: cstate StandAlone --> Unconnected
Nov 25 11:07:02 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate Unconnected --> WFConnection
Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFConnection --> WFReportParams
Nov 25 11:07:03 Proc2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 25 11:07:03 Proc2 kernel: drbd0: resync bitmap: bits=16391168 words=512224
Nov 25 11:07:03 Proc2 kernel: drbd0: size = 62 GB (65564672 KB)
Nov 25 11:07:03 Proc2 kernel: drbd0: Connection established.
Nov 25 11:07:03 Proc2 kernel: drbd0: I am(S): 1:00000005:00000003:00000090:0000004e:00
Nov 25 11:07:03 Proc2 kernel: drbd0: Peer(S): 1:00000005:00000003:00000091:0000004e:00
Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFReportParams --> WFBitMapT
Nov 25 11:07:03 Proc2 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
Nov 25 11:07:04 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFBitMapT --> SyncTarget
Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]).
Nov 25 11:07:18 Proc2 kernel: drbd0: Secondary/Secondary --> Secondary/Primary
Interesting to note is that the nodes seem to have different ideas about
how much data needs to be synchronized, i.e.:
Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]).
vs.
Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]).
The nodes are connected with a gigabit crossover. The network itself works
fine even after the sync halts. Sync rate is set to 30M, but I've also got
the same result using 10M. Also, in my configuration DRBD runs on top of a
LVM device.
Any ideas?
/Per