[DRBD-user] drbd not syncing

Wed Aug 10 01:33:45 CEST 2016

Hi,

I currently manage 12 drbd server pairs.  They were originally all built
out on Debian Squeeze (drbd 8.3.7), and I've been upgrading them to Debian
Wheezy (8.3.13).  So far I have 8 of the paris fully upgraded, 3 that are
halfway upgraded (one node on Squeeze and one on Wheezy) and one pair that
is currently broken.  I mention all of the boxes that are not having
problems simply to indicate that for the most part I know what I am doing
as I've done it successfully numerous times, yet this one pair is
problematic and I cannot determine the cause.

For the problematic pair, we have boxes r01 (squeeze) and r02 (wheezy).
 r01 currently is serving data off of the drbd volume and it is working
just fine.  r02 was rebuilt and will not stay connected to r01.  Here are
the logs from r02:

Aug  9 16:09:44 r02 kernel: [12739.178449] block drbd0: receiver (re)started
Aug  9 16:09:44 r02 kernel: [12739.178454] block drbd0: conn( Unconnected
-> WFConnection )
Aug  9 16:09:44 r02 kernel: [12739.912037] block drbd0: Handshake
successful: Agreed network protocol version 91
Aug  9 16:09:44 r02 kernel: [12739.912048] block drbd0: conn( WFConnection
-> WFReportParams )
Aug  9 16:09:44 r02 kernel: [12739.912074] block drbd0: Starting asender
thread (from drbd0_receiver [3740])
Aug  9 16:09:44 r02 kernel: [12739.936681] block drbd0: data-integrity-alg:
<not-used>
Aug  9 16:09:44 r02 kernel: [12739.936691] block drbd0: Considerable
difference in lower level device sizes: 256503768s vs. 1344982880s
Aug  9 16:09:44 r02 kernel: [12739.942918] block drbd0: drbd_sync_handshake:
Aug  9 16:09:44 r02 kernel: [12739.942923] block drbd0: self
E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000
bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942928] block drbd0: peer
E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B
bits:32062701 flags:0
Aug  9 16:09:44 r02 kernel: [12739.942933] block drbd0: uuid_compare()=-1
by rule 50
Aug  9 16:09:44 r02 kernel: [12739.942935] block drbd0: Becoming sync
target due to disk states.
Aug  9 16:09:44 r02 kernel: [12739.942946] block drbd0: peer( Unknown ->
Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug  9 16:09:44 r02 kernel: [12740.099597] block drbd0: conn( WFBitMapT ->
WFSyncUUID )
Aug  9 16:09:44 r02 kernel: [12740.104324] block drbd0: updated sync uuid
BF8D25FBE26085B0:0000000000000000:0000000000000000:0000000000000000
Aug  9 16:09:44 r02 kernel: [12740.104423] block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0
Aug  9 16:09:44 r02 kernel: [12740.106582] block drbd0: helper command:
/sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
Aug  9 16:09:44 r02 kernel: [12740.106591] block drbd0: conn( WFSyncUUID ->
SyncTarget )
Aug  9 16:09:44 r02 kernel: [12740.106599] block drbd0: Began resync as
SyncTarget (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r02 kernel: [12740.140796] block drbd0: meta connection
shut down by peer.
Aug  9 16:09:44 r02 kernel: [12740.141304] block drbd0: sock was shut down
by peer
Aug  9 16:09:44 r02 kernel: [12740.141309] block drbd0: peer( Primary ->
Unknown ) conn( SyncTarget -> BrokenPipe ) pdsk( UpToDate -> DUnknown )
Aug  9 16:09:44 r02 kernel: [12740.141316] block drbd0: short read
expecting header on sock: r=0
Aug  9 16:09:44 r02 kernel: [12740.142235] block drbd0: asender terminated
Aug  9 16:09:44 r02 kernel: [12740.142238] block drbd0: Terminating
drbd0_asender
Aug  9 16:09:44 r02 kernel: [12740.151561] block drbd0: bitmap WRITE of 979
pages took 2 jiffies
Aug  9 16:09:44 r02 kernel: [12740.151567] block drbd0: 122 GB (32062701
bits) marked out-of-sync by on disk bit-map.
Aug  9 16:09:44 r02 kernel: [12740.151580] block drbd0: Connection closed
Aug  9 16:09:44 r02 kernel: [12740.151586] block drbd0: conn( BrokenPipe ->
Unconnected )
Aug  9 16:09:44 r02 kernel: [12740.151592] block drbd0: receiver terminated

This just repeats over and over again.

Here are the logs from r01:

Aug  9 16:09:44 r01 kernel: [3438273.766768] block drbd0: receiver
(re)started
Aug  9 16:09:44 r01 kernel: [3438273.771898] block drbd0: conn( Unconnected
-> WFConnection )
Aug  9 16:09:44 r01 kernel: [3438274.474411] block drbd0: Handshake
successful: Agreed network protocol version 91
Aug  9 16:09:44 r01 kernel: [3438274.483299] block drbd0: conn(
WFConnection -> WFReportParams )
Aug  9 16:09:44 r01 kernel: [3438274.490420] block drbd0: Starting asender
thread (from drbd0_receiver [6366])
Aug  9 16:09:44 r01 kernel: [3438274.498900] block drbd0:
data-integrity-alg: <not-used>
Aug  9 16:09:44 r01 kernel: [3438274.505166] block drbd0: Considerable
difference in lower level device sizes: 1344982880s vs. 256503768s
Aug  9 16:09:44 r01 kernel: [3438274.516226] block drbd0: max_segment_size
( = BIO size ) = 65536
Aug  9 16:09:44 r01 kernel: [3438274.523385] block drbd0:
drbd_sync_handshake:
Aug  9 16:09:44 r01 kernel: [3438274.528677] block drbd0: self
E21F17F92705CD4F:E17D2EE7BC2C235F:1074ED292C876258:548AFBCD7D5C2C3B
bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.541195] block drbd0: peer
E17D2EE7BC2C235E:0000000000000000:0000000000000000:0000000000000000
bits:32062701 flags:0
Aug  9 16:09:44 r01 kernel: [3438274.553710] block drbd0: uuid_compare()=1
by rule 70
Aug  9 16:09:44 r01 kernel: [3438274.559677] block drbd0: Becoming sync
source due to disk states.
Aug  9 16:09:44 r01 kernel: [3438274.566897] block drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS )
Aug  9 16:09:44 r01 kernel: [3438274.666397] block drbd0: conn( WFBitMapS
-> SyncSource )
Aug  9 16:09:44 r01 kernel: [3438274.672845] block drbd0: Began resync as
SyncSource (will sync 128250804 KB [32062701 bits set]).
Aug  9 16:09:44 r01 kernel: [3438274.683196] block drbd0:
/build/buildd-linux-2.6_2.6.32-48squeeze3-amd64-mcoLgp/linux-2.6-2.6.32/debian/build/source_amd64_none/drivers/block/drbd/drbd_receiver.c:1932:
sector: 0s, size: 65536
Aug  9 16:09:45 r01 kernel: [3438274.702834] block drbd0: error receiving
RSDataRequest, l: 24!
Aug  9 16:09:45 r01 kernel: [3438274.702837] block drbd0: peer( Secondary
-> Unknown ) conn( SyncSource -> ProtocolError )
Aug  9 16:09:45 r01 kernel: [3438274.703005] block drbd0: asender terminated
Aug  9 16:09:45 r01 kernel: [3438274.703009] block drbd0: Terminating
drbd0_asender
Aug  9 16:09:45 r01 kernel: [3438274.711319] block drbd0: Connection closed
Aug  9 16:09:45 r01 kernel: [3438274.711323] block drbd0: conn(
ProtocolError -> Unconnected )
Aug  9 16:09:45 r01 kernel: [3438274.711329] block drbd0: receiver
terminated

Again, this just repeats over and over.

The drbd syncing is done over a dedicated 1gbps interfaces on both boxes
that are directly connected via a crossover cable.  sshing between the two
servers over that dedicated link works fine.  Pinging also works fine with
no packet loss even when I flood it with packets as fast as possible:

# ping -c 100 -i 0 -q 10.0.255.253
PING 10.0.255.253 (10.0.255.253) 56(84) bytes of data.

--- 10.0.255.253 ping statistics ---
100 packets transmitted, 100 received, 0% packet loss, time 15ms
rtt min/avg/max/mdev = 0.133/0.141/0.235/0.025 ms, ipg/ewma 0.156/0.137 ms

Here are the contents of my drbd.conf file (it is identical on both
systems):

global {
   usage-count no;
}

resource drbd0 {
  protocol C;
  handlers { pri-on-incon-degr "echo '!DRBD! pri on incon-degr' | wall ;
exit 1"; }
  startup {
    degr-wfc-timeout 60;    # 1 minute.
    wfc-timeout 55;
  }

  disk {
    on-io-error   detach;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on r01 {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p3;
    address    10.0.255.253:7788;
    meta-disk  internal;
  }

  on r02 {
    device     /dev/drbd0;
    disk       /dev/cciss/c0d0p6;
    address    10.0.255.254:7788;
    meta-disk  internal;
  }
}

Now, this is not the first time this has happened, and when it has occured,
I've replaced node that I rebuilt with another server, and things start
working.  But still no realy indication of why it stop in the first place
or why replacing the sever fixed the problem.

In this case, I've done the following:

1.  Wipe r02 and reinstall/configure a second time.
2.  Replace r02 and install/configure from scratch.
3.  Replace the crossover cable (twice).

None of those things helped.  The next thing I am going to do is replace
the r01 server and rebuild.  While I am hopeful that will fix the problem,
I am at a loss as to what is causing the problem and really want to
understand what is going on so that I can either avoid this problem in the
future or at least know what the root cause is.

If anyone can shed some light on this issue, it would be greatly
appreciated.

Thanks!

-- 
Michael Johnson - MJ
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160809/b634cd3c/attachment.htm>