[DRBD-user] Sync stalled

Juan Antonio Cortés jancorg at gmail.com
Fri May 21 20:28:02 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I've been trying to solve a strange issue with drbd but I dont see it clear.

I have 6 servers (3 pairs) in the same datacenter, but one pair don't want
to work fine...
They have same config, linux distribution and kernel.  drbd-utils are from
debian lenny package (same verion on all nodes).
Drbd partitions are configured over lvm.

The two non working nodes has same network cards as the other working nodes.
Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI
Express Gigabit Ethernet controller (rev 01)
Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI
Express Gigabit Ethernet controller (rev 03)

drbd.conf

skip {
}
global {
    usage-count no;
}
common {
  syncer {
    rate 33M;
#        verify-alg crc32c;
    }
}
resource r0 {
  protocol C;

  handlers {
#    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; ifconfig eth0 down";
#    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; ifconfig eth0 down";
    local-io-error "/sbin/drbd-io-error";
#    pri-lost "echo pri-lost. Have a look at the log files. | mail -s 'DRBD
Alert' root at almiralabs.com";
#    split-brain "echo split-brain. drbdadm -- --discard-my-data connect
$DRBD_RESOURCE ? | mail -s 'DRBD Alert' root at almiralabs.com";
}

  startup {
    wfc-timeout 20;
    degr-wfc-timeout 10;
  }
  disk {
    on-io-error pass_on;
#    fencing resource-and-stonith;
  }

  net {
     after-sb-0pri discard-younger-primary;
     after-sb-1pri violently-as0p;
     after-sb-2pri violently-as0p;
#    rr-conflict disconnect;

#     sndbuf-size 512k;
#     max-buffers     2048;
#     max-epoch-size  2048;
#    ko-count 4;
     cram-hmac-alg "sha1";
     shared-secret "xxxxx";
  }
  syncer {
    rate 33M;
    al-extents 257;
#    verify-alg crc32c;
  }

  on s5 {
    device     /dev/drbd0;
#    disk       /dev/md4;
     disk /dev/vg404/readhatAS5-hd2;
    address    123.123.123.123:7788;
    flexible-meta-disk  internal;

  }

  on s6 {
   device    /dev/drbd0;
#    disk      /dev/md4;
   disk /dev/vg404/readhatAS5-hd2;

   address   312.312.312.321:7788;
   flexible-meta-disk internal;
  }
}


/proc/drbd

version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by
root at ns360164.ovh.net, 2009-09-29 12:02:15
 0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C
r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164
        [>....................] sync'ed:  0.1% (51196/51196)M
        finish: 546:06:58 speed: 0 (0) K/sec
s6:~#  cat /proc/drbd
version: 8.3.3rc2 (api:88/proto:86-91)
GIT-hash: 04b2f175d7076ef2e0dd7d5ba6f6843357a041ed build by
root at ns360164.ovh.net, 2009-09-29 12:02:15
 0: cs:SyncSource ro:Secondary/Secondary ds:Inconsistent/Inconsistent C
r----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:52427164
        [>....................] sync'ed:  0.1% (51196/51196)M
        stalled


May 21 18:19:04 s6 kernel: block drbd0: Starting receiver thread (from
drbd0_worker [17029])
May 21 18:19:04 s6 kernel: block drbd0: receiver (re)started
May 21 18:19:04 s6 kernel: block drbd0: conn( Unconnected -> WFConnection )
May 21 18:19:04 s6 kernel: block drbd0: Handshake successful: Agreed network
protocol version 91
May 21 18:19:04 s6 kernel: block drbd0: Peer authenticated using 20 bytes of
'sha1' HMAC
May 21 18:19:04 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams
)
May 21 18:19:04 s6 kernel: block drbd0: Starting asender thread (from
drbd0_receiver [17095])
May 21 18:19:04 s6 kernel: block drbd0: data-integrity-alg: <not-used>
May 21 18:19:04 s6 kernel: block drbd0: drbd_sync_handshake:
May 21 18:19:04 s6 kernel: block drbd0: self
99DAB77B149D94CC:0000000000000000:0000000000000000:0000000000000000
bits:13106791 flag
s:0
May 21 18:19:04 s6 kernel: block drbd0: peer
DC7E66E5C2D873B7:99DAB77B149D94CD:0000000000000004:0000000000000000
bits:13106791 flag
s:0
May 21 18:19:04 s6 kernel: block drbd0: uuid_compare()=-1 by rule 50
May 21 18:19:04 s6 kernel: block drbd0: Becoming sync target due to disk
states.
May 21 18:19:04 s6 kernel: block drbd0: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )

May 21 18:19:05 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0
May 21 18:19:05 s6 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0 exit code 3 (0x300)
May 21 18:19:05 s6 kernel: block drbd0: before-resync-target handler
returned 3, dropping connection.
May 21 18:19:05 s6 kernel: block drbd0: peer( Primary -> Unknown ) conn(
WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown )

May 21 18:19:05 s6 kernel: block drbd0: asender terminated
May 21 18:19:05 s6 kernel: block drbd0: Terminating asender thread
May 21 18:19:05 s6 kernel: block drbd0: Connection closed
May 21 18:19:05 s6 kernel: block drbd0: conn( Disconnecting -> StandAlone )
May 21 18:19:05 s6 kernel: block drbd0: receiver terminated
May 21 18:19:05 s6 kernel: block drbd0: Terminating receiver thread
May 21 18:19:13 s6 kernel: block drbd0: disk( Inconsistent -> Diskless )
May 21 18:19:13 s6 kernel: block drbd0: drbd_bm_resize called with capacity
== 0
May 21 18:19:13 s6 kernel: block drbd0: worker terminated
May 21 18:19:13 s6 kernel: block drbd0: Terminating worker thread


Other times I can no see sync progress, because it fails with this output:


May 21 17:44:22 s6 kernel: block drbd0: Starting worker thread (from cqueue
[229])
May 21 17:44:22 s6 kernel: block drbd0: disk( Diskless -> Attaching )
May 21 17:44:22 s6 kernel: block drbd0: No usable activity log found.
May 21 17:44:22 s6 kernel: block drbd0: Method to ensure write ordering:
barrier
May 21 17:44:22 s6 kernel: block drbd0: Backing device's merge_bvec_fn() =
ffffffff8083cb50
May 21 17:44:22 s6 kernel: block drbd0: max_segment_size ( = BIO size ) =
4096
May 21 17:44:22 s6 kernel: block drbd0: drbd_bm_resize called with capacity
== 104854328
May 21 17:44:22 s6 kernel: block drbd0: resync bitmap: bits=13106791
words=204794
May 21 17:44:22 s6 kernel: block drbd0: size = 50 GB (52427164 KB)
May 21 17:44:22 s6 kernel: block drbd0: Writing the whole bitmap, size
changed
May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked
out-of-sync by on disk bit-map.
May 21 17:44:22 s6 kernel: block drbd0: recounting of set bits took
additional 0 jiffies
May 21 17:44:22 s6 kernel: block drbd0: 50 GB (13106791 bits) marked
out-of-sync by on disk bit-map.
May 21 17:44:22 s6 kernel: block drbd0: disk( Attaching -> Inconsistent )
May 21 17:45:30 s6 kernel: block drbd0: conn( StandAlone -> Unconnected )
May 21 17:45:30 s6 kernel: block drbd0: Starting receiver thread (from
drbd0_worker [11346])
May 21 17:45:30 s6 kernel: block drbd0: receiver (re)started
May 21 17:45:30 s6 kernel: block drbd0: conn( Unconnected -> WFConnection )
May 21 17:45:31 s6 kernel: block drbd0: Handshake successful: Agreed network
protocol version 91
May 21 17:45:31 s6 kernel: block drbd0: Peer authenticated using 20 bytes of
'sha1' HMAC
May 21 17:45:31 s6 kernel: block drbd0: conn( WFConnection -> WFReportParams
)
May 21 17:45:31 s6 kernel: block drbd0: Starting asender thread (from
drbd0_receiver [11772])
May 21 17:45:31 s6 kernel: block drbd0: data-integrity-alg: <not-used>
May 21 17:45:31 s6 kernel: block drbd0: drbd_sync_handshake:
May 21 17:45:31 s6 kernel: block drbd0: self
0000000000000004:0000000000000000:0000000000000000:0000000000000000
bits:13106791 flags:0
May 21 17:45:31 s6 kernel: block drbd0: peer
EEA95651B1BD8559:0000000000000004:0000000000000000:0000000000000000
bits:13106791 flags:0
May 21 17:45:31 s6 kernel: block drbd0: uuid_compare()=-2 by rule 20
May 21 17:45:31 s6 kernel: block drbd0: Becoming sync target due to disk
states.
May 21 17:45:31 s6 kernel: block drbd0: Writing the whole bitmap, full sync
required after drbd_sync_handshake.
May 21 17:45:31 s6 kernel: block drbd0: 50 GB (13106791 bits) marked
out-of-sync by on disk bit-map.
May 21 17:45:31 s6 kernel: block drbd0: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 21 17:45:31 s6 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0
May 21 17:45:31 s6 kernel: block drbd0: helper command: /sbin/drbdadm
before-resync-target minor-0 exit code 3 (0x300)
May 21 17:45:31 s6 kernel: block drbd0: before-resync-target handler
returned 3, dropping connection.
May 21 17:45:31 s6 kernel: block drbd0: peer( Primary -> Unknown ) conn(
WFSyncUUID -> Disconnecting ) pdsk( UpToDate -> DUnknown )


Im a bit stranged about this, cause others nodes works well, so I don't know
if it could be a network card or disk fail.

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20100521/61952778/attachment.htm>


More information about the drbd-user mailing list