[DRBD-user] sync stalled

Fri Aug 31 23:07:48 CEST 2007

Hello,

When I try to sync the two nodes, the sync seems to stall out 
indefinitely.  I must be missing something (something trivial I hope).  
This is the command I run on node1 to initiate the sync:

/drbdadm -- --do-what-I-say primary all/
/drbdadm -- connect all /

Here is the output of my drbd.conf

resource r0 {

protocol C;
incon-degr-cmd "halt -f";

startup {
        degr-wfc-timeout 120; # 2 minutes
}

disk {
        on-io-error     detach;
}

net {

}

syncer {

        rate 10M;

        group 1;

        al-extents 257;
}

on nfs1 {
        device /dev/drbd0;
        disk /dev/sda4;
        address 10.5.7.25:7788;
        meta-disk /dev/sda3[0];
}

on nfs2 {
        device /dev/drbd0;
        disk /dev/sda4;
        address 10.5.7.26:7788;
        meta-disk /dev/sda3[0];
}

}

Here is the output from my syslog on the primary node:

Aug 30 14:10:52 nfs1 kernel: drbd: module not supported by Novell, 
setting U taint flag.
Aug 30 14:10:52 nfs1 kernel: drbd: initialised. Version: 0.7.18 
(api:78/proto:74)
Aug 30 14:10:52 nfs1 kernel: drbd: SVN Revision: 2186 build by lmb at chip, 
2006-05-04 17:08:27
Aug 30 14:10:52 nfs1 kernel: drbd: registered as block device major 147
Aug 30 14:10:52 nfs1 kernel: drbd0: resync bitmap: bits=59215590 
words=1850488
Aug 30 14:10:52 nfs1 kernel: drbd0: size = 225 GB (236862360 KB)
Aug 30 14:10:52 nfs1 kernel: klogd 1.4.1, ---------- state change 
----------
Aug 30 14:10:53 nfs1 kernel: drbd0: 225 GB marked out-of-sync by on disk 
bit-map.
Aug 30 14:10:53 nfs1 kernel: drbd0: No usable activity log found.
Aug 30 14:10:53 nfs1 kernel: drbd0: Marked additional 0 KB as 
out-of-sync based on AL.
Aug 30 14:10:53 nfs1 kernel: drbd0: drbdsetup [4816]: cstate 
Unconfigured --> StandAlone
Aug 30 14:10:53 nfs1 kernel: drbd0: drbdsetup [4829]: cstate StandAlone 
--> Unconnected
Aug 30 14:10:53 nfs1 kernel: drbd0: drbd0_receiver [4830]: cstate 
Unconnected --> WFConnection
Aug 30 14:10:53 nfs1 kernel: drbd0: using degr_wfc_timeout=120 seconds
Aug 30 14:10:56 nfs1 kernel: drbd0: drbd0_receiver [4830]: cstate 
WFConnection --> WFReportParams
Aug 30 14:10:56 nfs1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 74
Aug 30 14:10:56 nfs1 kernel: drbd0: Connection established.
Aug 30 14:10:56 nfs1 kernel: drbd0: I am(S): 
1:00000002:00000001:00000007:00000001:00
Aug 30 14:10:56 nfs1 kernel: drbd0: Peer(S): 
0:00000002:00000001:00000005:00000001:00
Aug 30 14:10:56 nfs1 kernel: drbd0: drbd0_receiver [4830]: cstate 
WFReportParams --> WFBitMapS
Aug 30 14:10:56 nfs1 kernel: drbd0: Secondary/Unknown --> 
Secondary/Secondary
Aug 30 14:10:56 nfs1 kernel: drbd0: drbd0_receiver [4830]: cstate 
WFBitMapS --> SyncSource
Aug 30 14:10:56 nfs1 kernel: drbd0: Resync started as SyncSource (need 
to sync 236845976 KB [59211494 bits set]).
Aug 30 14:13:35 nfs1 kernel: drbd0: Secondary/Secondary --> 
Primary/Secondary

Here is the output from cat /proc/drbd on node1 and node2 respectively:

version: 0.7.18 (api:78/proto:74)
SVN Revision: 2186 build by lmb at chip, 2006-05-04 17:08:27
 0: cs:SyncSource st:Primary/Secondary ld:Consistent
    ns:274688 nr:0 dw:0 dr:274688 al:0 bm:16 lo:0 pe:0 ua:0 ap:0
        [>...................] sync'ed:  0.2% (231026/231294)M
        stalled

version: 0.7.18 (api:78/proto:74)
SVN Revision: 2186 build by lmb at chip, 2006-05-04 17:08:27
 0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
    ns:0 nr:274688 dw:274688 dr:0 al:0 bm:16 lo:0 pe:256 ua:0 ap:0
        [>...................] sync'ed:  0.2% (231026/231294)M
        stalled

One thing to note however.  After rebooting a few times and trying the 
commands manually I finally got the machines to sync.  However, when I 
try to manually test these I still get stalled sync.  I wonder if I'm 
just not typing the right commands.  This is what I'm doing.  When the 
nodes boot I'll run the SLES 10 init script - which as far as I can tell 
will modprobe drdb, drdbadm -d adjust, and then drdbadm wait_con_int.  
Then am I right to assume that I am to run drdbadm primary all on the 
primary node?  And will that resync the nodes?  If so, why would it 
stall out (just about every time)? 

I apologize if these questions are extremely remedial,  I've scoured the 
web and the mail archives but I can't seem to find the answers I'm 
looking for.

Any help would be appreciated greatly,

Matt