[DRBD-user] Frequent disconnect when doing backup.

Pascal Charest pascal.charest at labsphoenix.com
Sat Aug 27 22:52:13 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I have a small issue with one of my DRBD setup. When my backup is running
(-see lower for setup and backup details), i`m getting those errors:

Aug 27 10:24:18 pig-two -- MARK --
Aug 27 10:27:26 pig-two kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 27 10:27:26 pig-two kernel: drbd0: asender terminated
Aug 27 10:27:26 pig-two kernel: drbd0: Terminating asender thread
Aug 27 10:27:26 pig-two kernel: drbd0: sock was reset by peer
Aug 27 10:27:26 pig-two kernel: drbd0: _drbd_send_page: size=4096 len=3064
sent=-32
Aug 27 10:27:26 pig-two kernel: drbd0: Creating new current UUID
Aug 27 10:27:26 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 10:27:26 pig-two kernel: drbd0: tl_clear()
Aug 27 10:27:26 pig-two kernel: drbd0: Connection closed
Aug 27 10:27:26 pig-two kernel: drbd0: conn( NetworkFailure -> Unconnected )
Aug 27 10:27:26 pig-two kernel: drbd0: receiver terminated
Aug 27 10:27:26 pig-two kernel: drbd0: receiver (re)started
Aug 27 10:27:26 pig-two kernel: drbd0: conn( Unconnected -> WFConnection )
Aug 27 10:27:27 pig-two kernel: drbd0: Handshake successful: Agreed network
protocol version 88
Aug 27 10:27:27 pig-two kernel: drbd0: Peer authenticated using 20 bytes of
'sha1' HMAC
Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFConnection -> WFReportParams
)
Aug 27 10:27:27 pig-two kernel: drbd0: Starting asender thread (from
drbd0_receiver [3066])
Aug 27 10:27:27 pig-two kernel: drbd0: data-integrity-alg: md5
Aug 27 10:27:27 pig-two kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk(
UpToDate -> Inconsistent )
Aug 27 10:27:27 pig-two kernel: drbd0: Began resync as SyncSource (will sync
2160 KB [540 bits set]).
Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 10:27:27 pig-two kernel: drbd0: Resync done (total 1 sec; paused 0
sec; 2160 K/sec)
Aug 27 10:27:27 pig-two kernel: drbd0: conn( SyncSource -> Connected ) pdsk(
Inconsistent -> UpToDate )
Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 10:44:19 pig-two -- MARK --

and

Aug 27 11:04:19 pig-two -- MARK --
Aug 27 11:20:36 pig-two kernel: drbd0: _drbd_send_page: size=4096 len=4096
sent=-104
Aug 27 11:20:37 pig-two kernel: drbd0: peer( Secondary -> Unknown ) conn(
Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 27 11:20:37 pig-two kernel: drbd0: Creating new current UUID
Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 11:20:37 pig-two kernel: drbd0: asender terminated
Aug 27 11:20:37 pig-two kernel: drbd0: Terminating asender thread
Aug 27 11:20:37 pig-two kernel: drbd0: sock was shut down by peer
Aug 27 11:20:37 pig-two kernel: drbd0: tl_clear()
Aug 27 11:20:37 pig-two kernel: drbd0: Connection closed
Aug 27 11:20:37 pig-two kernel: drbd0: conn( NetworkFailure -> Unconnected )
Aug 27 11:20:37 pig-two kernel: drbd0: receiver terminated
Aug 27 11:20:37 pig-two kernel: drbd0: receiver (re)started
Aug 27 11:20:37 pig-two kernel: drbd0: conn( Unconnected -> WFConnection )
Aug 27 11:20:37 pig-two kernel: drbd0: Handshake successful: Agreed network
protocol version 88
Aug 27 11:20:37 pig-two kernel: drbd0: Peer authenticated using 20 bytes of
'sha1' HMAC
Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFConnection -> WFReportParams
)
Aug 27 11:20:37 pig-two kernel: drbd0: Starting asender thread (from
drbd0_receiver [3066])
Aug 27 11:20:37 pig-two kernel: drbd0: data-integrity-alg: md5
Aug 27 11:20:37 pig-two kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk(
UpToDate -> Inconsistent )
Aug 27 11:20:37 pig-two kernel: drbd0: Began resync as SyncSource (will sync
5788 KB [1447 bits set]).
Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 11:20:37 pig-two kernel: drbd0: Resync done (total 1 sec; paused 0
sec; 5788 K/sec)
Aug 27 11:20:37 pig-two kernel: drbd0: conn( SyncSource -> Connected ) pdsk(
Inconsistent -> UpToDate )
Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.
Aug 27 11:44:19 pig-two -- MARK --

Analysis: it look like the network is failing, then everything - under a
second - re-connect, resync and work again. There are no impact on the
'production'. Anyone got some kind of idea, why ? Is it an error in my
setup/design (see lower).


*Some background on the setup: *

It's an old version. Very old in fact - roadmap to upgrade has been drafted
and submitted to client - I`m just wondering about the specific issue
here... I want to be sure it's not an infrastructure design problem.
pig-two:~# cat /proc/drbd
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at pig-two,
2008-08-19 15:02:28
 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
    ns:650469968 nr:0 dw:648856776 dr:16725553 al:5463958 bm:22571 lo:0 pe:0
ua:0 ap:0 oos:0

We are speaking, of:
 -   4x SAS 15k drives in a hardware raid-5 array (DELL Perc5)... presented
to the OS as /dev/sda.
 -   /dev/sda is the back-end device for DRBD... presented to the OS as
/dev/drbd0
 -   /dev/drbd0 is a lone "physical volume" in a volume group (called SAN)
from which Logical Volume are created. Those are NOT locally mounted.
 -   those logical volumes are exported with vblade (AoE protocol, layer 2)
to some other physical system (Xen dom0) where they are used as backend
device (/dev/etherd/e0.1) for root volume of virtual system

Everything work fine, but when I do backup, I follow this process:
 -  mount a CIFS exported share over the network
 -  take a LV snapshot, mount it, and copy everything to the CIFS share.
 -  unmount snapshot, delete it... do for all LV.
 -  unmount network share

The backup are consistent and valid (tested)...  What have I missed ? Should
I move away from AoE to a Linux based iSCSI ?

P.

--
Pascal Charest - *Cutting-edge technology consultant*
https://www.labsphoenix.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110827/2251fb50/attachment.htm>


More information about the drbd-user mailing list