Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm having problems with DRBD getting stuck at around 99-100% during an initial/full sync. This seems to be happening about 8 out of 10 times. If I do "drbdadm down all" on both sides and then "drbdadm up all", both nodes connect just fine and both end up in a consistent state. But for some reason drbd will not by itself detect that the sync has actually completed. This is what it looks like when they get stuck: Proc1:~ # cat /proc/drbd version: 0.7.4 (api:76/proto:74) SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07 0: cs:SyncSource st:Primary/Secondary ld:Consistent ns:60558616 nr:0 dw:360 dr:60558461 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0 [===================>] sync'ed: 99.6% (248/59387)M finish: 4:45:21 speed: 12 (10,488) K/sec Proc2:~ # cat /proc/drbd version: 0.7.4 (api:76/proto:74) SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent ns:0 nr:60558616 dw:60558616 dr:0 al:0 bm:3697 lo:0 pe:0 ua:0 ap:0 [===================>] sync'ed:100.0% (0/59139)M finish: 0:00:00 speed: 16 (10,480) K/sec /var/log/messages on Proc1: ... Nov 25 11:07:02 Proc1 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74) Nov 25 11:07:02 Proc1 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07 Nov 25 11:07:02 Proc1 kernel: drbd: registered as block device major 147 Nov 25 11:07:02 Proc1 kernel: drbd0: resync bitmap: bits=1251563 words=39112 Nov 25 11:07:02 Proc1 kernel: drbd0: size = 4888 MB (5006250 KB) Nov 25 11:07:02 Proc1 kernel: drbd0: 248 MB marked out-of-sync by on disk bit-map. Nov 25 11:07:02 Proc1 kernel: drbd0: Found 4 transactions (64 active extents) in activity log. Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1094]: cstate Unconfigured --> StandAlone Nov 25 11:07:02 Proc1 kernel: drbd0: drbdsetup [1096]: cstate StandAlone --> Unconnected Nov 25 11:07:02 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate Unconnected --> WFConnection Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFConnection --> WFReportParams Nov 25 11:07:03 Proc1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Nov 25 11:07:03 Proc1 kernel: drbd0: resync bitmap: bits=16391168 words=512224 Nov 25 11:07:03 Proc1 kernel: drbd0: size = 62 GB (65564672 KB) Nov 25 11:07:03 Proc1 kernel: drbd0: Connection established. Nov 25 11:07:03 Proc1 kernel: drbd0: I am(S): 1:00000005:00000003:00000091:0000004e:00 Nov 25 11:07:03 Proc1 kernel: drbd0: Peer(S): 1:00000005:00000003:00000090:0000004e:00 Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFReportParams --> WFBitMapS Nov 25 11:07:03 Proc1 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Nov 25 11:07:03 Proc1 kernel: drbd0: drbd0_receiver [1097]: cstate WFBitMapS --> SyncSource Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]). Nov 25 11:07:18 Proc1 kernel: drbd0: Secondary/Secondary --> Primary/Secondary Nov 25 11:15:38 Proc1 kernel: drbd0: [drbd0_worker/1095] sock_sendmsg time expired, ko = 4294967295 /var/log/messages on Proc2: ... Nov 25 11:07:02 Proc2 kernel: drbd: initialised. Version: 0.7.4 (api:76/proto:74) Nov 25 11:07:02 Proc2 kernel: drbd: SVN Revision: 1539 build by lmb at chip, 2004-09-14 10:21:07 Nov 25 11:07:02 Proc2 kernel: drbd: registered as block device major 147 Nov 25 11:07:02 Proc2 kernel: drbd0: resync bitmap: bits=1251563 words=39112 Nov 25 11:07:02 Proc2 kernel: drbd0: size = 4888 MB (5006250 KB) Nov 25 11:07:02 Proc2 kernel: drbd0: 80 KB marked out-of-sync by on disk bit-map. Nov 25 11:07:02 Proc2 kernel: drbd0: Found 4 transactions (52 active extents) in activity log. Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1105]: cstate Unconfigured --> StandAlone Nov 25 11:07:02 Proc2 kernel: drbd0: drbdsetup [1107]: cstate StandAlone --> Unconnected Nov 25 11:07:02 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate Unconnected --> WFConnection Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFConnection --> WFReportParams Nov 25 11:07:03 Proc2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Nov 25 11:07:03 Proc2 kernel: drbd0: resync bitmap: bits=16391168 words=512224 Nov 25 11:07:03 Proc2 kernel: drbd0: size = 62 GB (65564672 KB) Nov 25 11:07:03 Proc2 kernel: drbd0: Connection established. Nov 25 11:07:03 Proc2 kernel: drbd0: I am(S): 1:00000005:00000003:00000090:0000004e:00 Nov 25 11:07:03 Proc2 kernel: drbd0: Peer(S): 1:00000005:00000003:00000091:0000004e:00 Nov 25 11:07:03 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFReportParams --> WFBitMapT Nov 25 11:07:03 Proc2 kernel: drbd0: Secondary/Unknown --> Secondary/Secondary Nov 25 11:07:04 Proc2 kernel: drbd0: drbd0_receiver [1108]: cstate WFBitMapT --> SyncTarget Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]). Nov 25 11:07:18 Proc2 kernel: drbd0: Secondary/Secondary --> Secondary/Primary Interesting to note is that the nodes seem to have different ideas about how much data needs to be synchronized, i.e.: Nov 25 11:07:03 Proc1 kernel: drbd0: Resync started as SyncSource (need to sync 60812372 KB [15203093 bits set]). vs. Nov 25 11:07:04 Proc2 kernel: drbd0: Resync started as SyncTarget (need to sync 60558500 KB [15139625 bits set]). The nodes are connected with a gigabit crossover. The network itself works fine even after the sync halts. Sync rate is set to 30M, but I've also got the same result using 10M. Also, in my configuration DRBD runs on top of a LVM device. Any ideas? /Per