[DRBD-user] Sync stuck at 100%

Per Liden per at fukt.bth.se
Thu Nov 25 12:59:28 CET 2004


Hi,

I'm having problems with DRBD getting stuck at around 99-100% during an 
initial/full sync. This seems to be happening about 8 out of 10 times. If 
I do "drbdadm down all" on both sides and then "drbdadm up all", both 
nodes connect just fine and both end up in a consistent state. But for 
some reason drbd will not by itself detect that the sync has actually 
completed. This is what it looks like when they get stuck:

Proc1:~ # cat /proc/drbd 
version: 0.7.4 (api:76/proto:74)
SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:60558616 nr:0 dw:360 dr:60558461 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0
        [===================>] sync'ed: 99.6% (248/59387)M
        finish: 4:45:21 speed: 12 (10,488) K/sec

Proc2:~ # cat /proc/drbd 
version: 0.7.4 (api:76/proto:74)
SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:60558616 dw:60558616 dr:0 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0
        [===================>] sync'ed:100.0% (0/59139)M
        finish: 0:00:00 speed: 16 (10,480) K/sec


/var/log/messages on Proc1:
...
Nov 25 11:07:02 Proc1 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74)
Nov 25 11:07:02 Proc1 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
Nov 25 11:07:02 Proc1 kernel: drbd: registered as block device major 147
Nov 25 11:07:02 Proc1 kernel: drbd0: resync bitmap: bits=1251563 words=39112
Nov 25 11:07:02 Proc1 kernel: drbd0: size = 4888 MB (5006250 KB)
Nov 25 11:07:02 Proc1 kernel: drbd0: 248 MB marked out-of-sync by on disk bit-map.
Nov 25 11:07:02 Proc1 kernel: drbd0: Found 4 transactions (64 active extents) in activity log.
Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1094]: cstate Unconfigured --> StandAlone
Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1096]: cstate StandAlone --> Unconnected
Nov 25 11:07:02 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate Unconnected --> WFConnection
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFConnection --> WFReportParams
Nov 25 11:07:03 Proc1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 25 11:07:03 Proc1 kernel: drbd0: resync bitmap: bits=16391168 words=512224
Nov 25 11:07:03 Proc1 kernel: drbd0: size = 62 GB (65564672 KB)
Nov 25 11:07:03 Proc1 kernel: drbd0: Connection established.
Nov 25 11:07:03 Proc1 kernel: drbd0: I am(S): 1:00000005:00000003:00000091:0000004e:00
Nov 25 11:07:03 Proc1 kernel: drbd0: Peer(S): 1:00000005:00000003:00000090:0000004e:00
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFReportParams --> WFBitMapS
Nov 25 11:07:03 Proc1 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFBitMapS --> SyncSource
Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]).
Nov 25 11:07:18 Proc1 kernel: drbd0: Secondary/Secondary --> Primary/Secondary
Nov 25 11:15:38 Proc1 kernel: drbd0: [drbd0_worker/1095] sock_sendmsg time expired, ko = 4294967295

/var/log/messages on Proc2:
...
Nov 25 11:07:02 Proc2 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74)
Nov 25 11:07:02 Proc2 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07
Nov 25 11:07:02 Proc2 kernel: drbd: registered as block device major 147
Nov 25 11:07:02 Proc2 kernel: drbd0: resync bitmap: bits=1251563 words=39112
Nov 25 11:07:02 Proc2 kernel: drbd0: size = 4888 MB (5006250 KB)
Nov 25 11:07:02 Proc2 kernel: drbd0: 80 KB marked out-of-sync by on disk bit-map.
Nov 25 11:07:02 Proc2 kernel: drbd0: Found 4 transactions (52 active extents) in activity log.
Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1105]: cstate Unconfigured --> StandAlone
Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1107]: cstate StandAlone --> Unconnected
Nov 25 11:07:02 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate Unconnected --> WFConnection
Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFConnection --> WFReportParams
Nov 25 11:07:03 Proc2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74
Nov 25 11:07:03 Proc2 kernel: drbd0: resync bitmap: bits=16391168 words=512224
Nov 25 11:07:03 Proc2 kernel: drbd0: size = 62 GB (65564672 KB)
Nov 25 11:07:03 Proc2 kernel: drbd0: Connection established.
Nov 25 11:07:03 Proc2 kernel: drbd0: I am(S): 1:00000005:00000003:00000090:0000004e:00
Nov 25 11:07:03 Proc2 kernel: drbd0: Peer(S): 1:00000005:00000003:00000091:0000004e:00
Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFReportParams --> WFBitMapT
Nov 25 11:07:03 Proc2 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary
Nov 25 11:07:04 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFBitMapT --> SyncTarget
Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]).
Nov 25 11:07:18 Proc2 kernel: drbd0: Secondary/Secondary --> Secondary/Primary


Interesting to note is that the nodes seem to have different ideas about 
how much data needs to be synchronized, i.e.:
  Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]).
vs.
  Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]).

The nodes are connected with a gigabit crossover. The network itself works 
fine even after the sync halts. Sync rate is set to 30M, but I've also got 
the same result using 10M. Also, in my configuration DRBD runs on top of a 
LVM device.

Any ideas?

/Per



More information about the drbd-user mailing list