Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I currently manage 12 drbd server pairs. They were originally all built out on Debian Squeeze (drbd 8.3.7), and I've been upgrading them to Debian Wheezy (8.3.13). So far I have 8 of the paris fully upgraded, 3 that are halfway upgraded (one node on Squeeze and one on Wheezy) and one pair that is currently broken. I mention all of the boxes that are not having problems simply to indicate that for the most part I know what I am doing as I've done it successfully numerous times, yet this one pair is problematic and I cannot determine the cause. For the problematic pair, we have boxes r01 (squeeze) and r02 (wheezy). r01 currently is serving data off of the drbd volume and it is working just fine. r02 was rebuilt and will not stay connected to r01. Here are the logs from r02: Aug 9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started Aug 9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected -> WFConnection ) Aug 9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake successful: Agreed network protocol version 91 Aug 9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection -> WFReportParams ) Aug 9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender thread (from drbd0_receiver [3740]) Aug 9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg: <not-used> Aug 9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable difference in lower level device sizes: 256503768s vs. 1344982880s Aug 9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake: Aug 9 16:09:44 r02 kernel: [12739.942923] block drbd0: self E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0 Aug 9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0 Aug 9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1 by rule 50 Aug 9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync target due to disk states. Aug 9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) Aug 9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT -> WFSyncUUID ) Aug 9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000 Aug 9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 Aug 9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command: /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) Aug 9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID -> SyncTarget ) Aug 9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as SyncTarget (will sync 128250804 KB [32062701 bits set]). Aug 9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection shut down by peer. Aug 9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down by peer Aug 9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary -> Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Aug 9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read expecting header on sock: r=0 Aug 9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated Aug 9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating drbd0_asender Aug 9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979 pages took 2 jiffies Aug 9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701 bits) marked out-of-sync by on disk bit-map. Aug 9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed Aug 9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe -> Unconnected ) Aug 9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated This just repeats over and over again. Here are the logs from r01: Aug 9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver (re)started Aug 9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected -> WFConnection ) Aug 9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake successful: Agreed network protocol version 91 Aug 9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn( WFConnection -> WFReportParams ) Aug 9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender thread (from drbd0_receiver [6366]) Aug 9 16:09:44 r01 kernel: [3438274.498900] block drbd0: data-integrity-alg: <not-used> Aug 9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable difference in lower level device sizes: 1344982880s vs. 256503768s Aug 9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size ( = BIO size ) = 65536 Aug 9 16:09:44 r01 kernel: [3438274.523385] block drbd0: drbd_sync_handshake: Aug 9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B bits:32062701 flags:0 Aug 9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000 bits:32062701 flags:0 Aug 9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1 by rule 70 Aug 9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync source due to disk states. Aug 9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) Aug 9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS -> SyncSource ) Aug 9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as SyncSource (will sync 128250804 KB [32062701 bits set]). Aug 9 16:09:44 r01 kernel: [3438274.683196] block drbd0: /build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932: sector: 0s, size: 65536 Aug 9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving RSDataRequest, l: 24! Aug 9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> ProtocolError ) Aug 9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated Aug 9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating drbd0_asender Aug 9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed Aug 9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn( ProtocolError -> Unconnected ) Aug 9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver terminated Again, this just repeats over and over. The drbd syncing is done over a dedicated 1gbps interfaces on both boxes that are directly connected via a crossover cable. sshing between the two servers over that dedicated link works fine. Pinging also works fine with no packet loss even when I flood it with packets as fast as possible: # ping -c 100 -i 0 -q 10.0.255.253 PING 10.0.255.253 (10.0.255.253) 56(84) bytes of data. --- 10.0.255.253 ping statistics --- 100 packets transmitted, 100 received, 0% packet loss, time 15ms rtt min/avg/max/mdev = 0.133/0.141/0.235/0.025 ms, ipg/ewma 0.156/0.137 ms Here are the contents of my drbd.conf file (it is identical on both systems): global { usage-count no; } resource drbd0 { protocol C; handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ; exit 1"; } startup { degr-wfc-timeout 60; # 1 minute. wfc-timeout 55; } disk { on-io-error detach; } syncer { rate 100M; al-extents 257; } on r01 { device /dev/drbd0; disk /dev/cciss/c0d0p3; address 10.0.255.253:7788; meta-disk internal; } on r02 { device /dev/drbd0; disk /dev/cciss/c0d0p6; address 10.0.255.254:7788; meta-disk internal; } } Now, this is not the first time this has happened, and when it has occured, I've replaced node that I rebuilt with another server, and things start working. But still no realy indication of why it stop in the first place or why replacing the sever fixed the problem. In this case, I've done the following: 1. Wipe r02 and reinstall/configure a second time. 2. Replace r02 and install/configure from scratch. 3. Replace the crossover cable (twice). None of those things helped. The next thing I am going to do is replace the r01 server and rebuild. While I am hopeful that will fix the problem, I am at a loss as to what is causing the problem and really want to understand what is going on so that I can either avoid this problem in the future or at least know what the root cause is. If anyone can shed some light on this issue, it would be greatly appreciated. Thanks! -- Michael Johnson - MJ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160809/b634cd3c/attachment.htm>