[DRBD-user] Frequent disconnect when doing backup.

Pascal Charest pascal.charest at labsphoenix.com
Sun Aug 28 15:59:56 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

It always `worked` - it doesn't crash. Only the communication seem to get
interrupted for a few seconds while backup are being taken. Backup are valid
and the setup can survive with a few seconds where redundancy is not
available.

I should have asked that question when I build the setup 4 years ago, but...
yeah... and now I'm trying to fix everything up for that client.

The broken communication seems to happen only when I'm mounting the backup
snapshot and taking RAR from it. Might be a problem on the AoE side of
things along with a LVM snapshot.


P.

On Sun, Aug 28, 2011 at 9:18 AM, Pascal BERTON <pascal.berton3 at free.fr>wrote:

> Pascal,****
>
> ** **
>
> One thing is unclear : did it used to work in the past (and if yes what has
> changed lately that could explain this behavior) or is it a new feature
> you’ve just added to your customer’s config ?****
>
> Furthermore, I suspect you have scripted all this process haven’t you ? If
> so, have you identified which step induces this communication disruption?
> Have you tried to execute manually this sequence and then at what step does
> it happen ?****
>
> ** **
>
> Best regards,****
>
> ** **
>
> Pascal.****
>
> ** **
>
> *De :* drbd-user-bounces at lists.linbit.com [mailto:
> drbd-user-bounces at lists.linbit.com] *De la part de* Pascal Charest
> *Envoyé :* samedi 27 août 2011 22:52
> *À :* drbd-user at lists.linbit.com
> *Objet :* [DRBD-user] Frequent disconnect when doing backup.****
>
> ** **
>
> Hi,
> ****
>
> ** **
>
> I have a small issue with one of my DRBD setup. When my backup is running
> (-see lower for setup and backup details), i`m getting those errors:****
>
> ** **
>
> Aug 27 10:24:18 pig-two -- MARK --****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: peer( Secondary -> Unknown ) conn(
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: asender terminated****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: Terminating asender thread****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: sock was reset by peer****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: _drbd_send_page: size=4096 len=3064
> sent=-32****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: Creating new current UUID****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 10:27:26 pig-two kernel: drbd0: tl_clear()****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: Connection closed****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: conn( NetworkFailure -> Unconnected
> )****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: receiver terminated****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: receiver (re)started****
>
> Aug 27 10:27:26 pig-two kernel: drbd0: conn( Unconnected -> WFConnection )
> ****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Handshake successful: Agreed network
> protocol version 88****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Peer authenticated using 20 bytes of
> 'sha1' HMAC****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFConnection -> WFReportParams
> )****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Starting asender thread (from
> drbd0_receiver [3066])****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: data-integrity-alg: md5****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource )
> pdsk( UpToDate -> Inconsistent )****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Began resync as SyncSource (will
> sync 2160 KB [540 bits set]).****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Resync done (total 1 sec; paused 0
> sec; 2160 K/sec)****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: conn( SyncSource -> Connected )
> pdsk( Inconsistent -> UpToDate )****
>
> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 10:44:19 pig-two -- MARK --****
>
> ** **
>
> and****
>
> ** **
>
> Aug 27 11:04:19 pig-two -- MARK --****
>
> Aug 27 11:20:36 pig-two kernel: drbd0: _drbd_send_page: size=4096 len=4096
> sent=-104****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Secondary -> Unknown ) conn(
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Creating new current UUID****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 11:20:37 pig-two kernel: drbd0: asender terminated****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Terminating asender thread****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: sock was shut down by peer****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: tl_clear()****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Connection closed****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: conn( NetworkFailure -> Unconnected
> )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: receiver terminated****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: receiver (re)started****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: conn( Unconnected -> WFConnection )
> ****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Handshake successful: Agreed network
> protocol version 88****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Peer authenticated using 20 bytes of
> 'sha1' HMAC****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFConnection -> WFReportParams
> )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Starting asender thread (from
> drbd0_receiver [3066])****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: data-integrity-alg: md5****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource )
> pdsk( UpToDate -> Inconsistent )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Began resync as SyncSource (will
> sync 5788 KB [1447 bits set]).****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Resync done (total 1 sec; paused 0
> sec; 5788 K/sec)****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: conn( SyncSource -> Connected )
> pdsk( Inconsistent -> UpToDate )****
>
> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block now.*
> ***
>
> Aug 27 11:44:19 pig-two -- MARK --****
>
> ** **
>
> Analysis: it look like the network is failing, then everything - under a
> second - re-connect, resync and work again. There are no impact on the
> 'production'. Anyone got some kind of idea, why ? Is it an error in my
> setup/design (see lower).****
>
> ** **
>
> ** **
>
> *Some background on the setup: *****
>
> ** **
>
> It's an old version. Very old in fact - roadmap to upgrade has been drafted
> and submitted to client - I`m just wondering about the specific issue
> here... I want to be sure it's not an infrastructure design problem.****
>
> pig-two:~# cat /proc/drbd****
>
> version: 8.2.6 (api:88/proto:86-88)****
>
> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at pig-two,
> 2008-08-19 15:02:28****
>
>  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---****
>
>     ns:650469968 nr:0 dw:648856776 dr:16725553 al:5463958 bm:22571 lo:0
> pe:0 ua:0 ap:0 oos:0****
>
> ** **
>
> We are speaking, of:****
>
>  -   4x SAS 15k drives in a hardware raid-5 array (DELL Perc5)... presented
> to the OS as /dev/sda. ****
>
>  -   /dev/sda is the back-end device for DRBD... presented to the OS as
> /dev/drbd0****
>
>  -   /dev/drbd0 is a lone "physical volume" in a volume group (called SAN)
> from which Logical Volume are created. Those are NOT locally mounted.****
>
>  -   those logical volumes are exported with vblade (AoE protocol, layer 2)
> to some other physical system (Xen dom0) where they are used as backend
> device (/dev/etherd/e0.1) for root volume of virtual system****
>
> ** **
>
> Everything work fine, but when I do backup, I follow this process: ****
>
>  -  mount a CIFS exported share over the network****
>
>  -  take a LV snapshot, mount it, and copy everything to the CIFS share. *
> ***
>
>  -  unmount snapshot, delete it... do for all LV.****
>
>  -  unmount network share****
>
> ** **
>
> The backup are consistent and valid (tested)...  What have I missed ?
> Should I move away from AoE to a Linux based iSCSI ?  ****
>
> ** **
>
> P.****
>
> ** **
>
> --****
>
> Pascal Charest - *Cutting-edge technology consultant*
> https://www.labsphoenix.com ****
>



-- 
--
Pascal Charest -* Cutting-edge technology consultant*
Les Laboratoires Phoenix <https://labsphoenix.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110828/61f0b0c7/attachment.htm>


More information about the drbd-user mailing list