[DRBD-user] Frequent disconnect when doing backup.

Pascal Charest pascal.charest at labsphoenix.com
Tue Aug 30 14:08:17 CEST 2011

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I'm getting kind of ashamed here. I already mentioned its an old setup, but,
yeah - the whole process is to get it updated to latest kernel + upgrade the
hardware... I'm just curious about the specific issue, to know if its a flaw
in the design.

So, GNU/Linux distribution is Debian-4.0, running under kernel
2.6.18-6-686-bigmem. The 'fileserver' doesn't have XEN installed, but I'm
pretty sure my exported raw device from AoE are equivalent to your disks as
files on NFS from I/O point of view. My export are through a single ethernet
1gbps link with no bonding installed (yet).

P.


On Tue, Aug 30, 2011 at 6:30 AM, Martin Rusko <martin.rusko at gmail.com>wrote:

> Pascal,
>
> what is the kernel and distribution you're running there, please? I'm
> just curious, as I see somewhat similar behavior with two nodes
> running drbd, ocfs2, corosync+pacemaker and xen to host couple of
> virtual guests. As a proof-of-concept, I have some guests having disks
> as files on NFS mounted directory from external NFS server. If there
> is heavy IO in these virtual machines, I can observe very short drbd
> disconnections and also corosync complains about being paused for two
> long (up to 16seconds!, normally it sends some traffic over the
> network 3 times per second). When corosync is paused for as long as
> those 16 seconds, that node gets "stonithed" by remaining cluster
> members.
>
> My setup is Debian/Squeeze with packages from official repositories,
> with kernel 2.6.32-5-xen-amd64. I'm still running around like headless
> chicken, trying different things, right now to run kernel with
> CONFIG_PREEMPT=y or maybe a different kernel version. Having some
> experience with linux kernel tracing, maybe it would be possible what
> blocks execution of drbd or corosync processes making them to start
> failing.
>
> Best Regards,
> Martin
>
>
>
> On Sun, Aug 28, 2011 at 3:59 PM, Pascal Charest
> <pascal.charest at labsphoenix.com> wrote:
> > Hi,
> > It always `worked` - it doesn't crash. Only the communication seem to get
> > interrupted for a few seconds while backup are being taken. Backup are
> valid
> > and the setup can survive with a few seconds where redundancy is not
> > available.
> > I should have asked that question when I build the setup 4 years ago,
> but...
> > yeah... and now I'm trying to fix everything up for that client.
> > The broken communication seems to happen only when I'm mounting the
> backup
> > snapshot and taking RAR from it. Might be a problem on the AoE side of
> > things along with a LVM snapshot.
> >
> > P.
> >
> > On Sun, Aug 28, 2011 at 9:18 AM, Pascal BERTON <pascal.berton3 at free.fr>
> > wrote:
> >>
> >> Pascal,
> >>
> >>
> >>
> >> One thing is unclear : did it used to work in the past (and if yes what
> >> has changed lately that could explain this behavior) or is it a new
> feature
> >> you’ve just added to your customer’s config ?
> >>
> >> Furthermore, I suspect you have scripted all this process haven’t you ?
> If
> >> so, have you identified which step induces this communication
> disruption?
> >> Have you tried to execute manually this sequence and then at what step
> does
> >> it happen ?
> >>
> >>
> >>
> >> Best regards,
> >>
> >>
> >>
> >> Pascal.
> >>
> >>
> >>
> >> De : drbd-user-bounces at lists.linbit.com
> >> [mailto:drbd-user-bounces at lists.linbit.com] De la part de Pascal
> Charest
> >> Envoyé : samedi 27 août 2011 22:52
> >> À : drbd-user at lists.linbit.com
> >> Objet : [DRBD-user] Frequent disconnect when doing backup.
> >>
> >>
> >>
> >> Hi,
> >>
> >>
> >>
> >> I have a small issue with one of my DRBD setup. When my backup is
> running
> >> (-see lower for setup and backup details), i`m getting those errors:
> >>
> >>
> >>
> >> Aug 27 10:24:18 pig-two -- MARK --
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: peer( Secondary -> Unknown )
> conn(
> >> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: asender terminated
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: Terminating asender thread
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: sock was reset by peer
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: _drbd_send_page: size=4096
> len=3064
> >> sent=-32
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: Creating new current UUID
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: tl_clear()
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: Connection closed
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: conn( NetworkFailure ->
> Unconnected
> >> )
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: receiver terminated
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: receiver (re)started
> >>
> >> Aug 27 10:27:26 pig-two kernel: drbd0: conn( Unconnected -> WFConnection
> )
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Handshake successful: Agreed
> >> network protocol version 88
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Peer authenticated using 20 bytes
> >> of 'sha1' HMAC
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFConnection ->
> >> WFReportParams )
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Starting asender thread (from
> >> drbd0_receiver [3066])
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: data-integrity-alg: md5
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: peer( Unknown -> Secondary )
> conn(
> >> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource )
> >> pdsk( UpToDate -> Inconsistent )
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Began resync as SyncSource (will
> >> sync 2160 KB [540 bits set]).
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Resync done (total 1 sec; paused
> 0
> >> sec; 2160 K/sec)
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( SyncSource -> Connected )
> >> pdsk( Inconsistent -> UpToDate )
> >>
> >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 10:44:19 pig-two -- MARK --
> >>
> >>
> >>
> >> and
> >>
> >>
> >>
> >> Aug 27 11:04:19 pig-two -- MARK --
> >>
> >> Aug 27 11:20:36 pig-two kernel: drbd0: _drbd_send_page: size=4096
> len=4096
> >> sent=-104
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Secondary -> Unknown )
> conn(
> >> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Creating new current UUID
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: asender terminated
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Terminating asender thread
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: sock was shut down by peer
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: tl_clear()
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Connection closed
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( NetworkFailure ->
> Unconnected
> >> )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: receiver terminated
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: receiver (re)started
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( Unconnected -> WFConnection
> )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Handshake successful: Agreed
> >> network protocol version 88
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Peer authenticated using 20 bytes
> >> of 'sha1' HMAC
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFConnection ->
> >> WFReportParams )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Starting asender thread (from
> >> drbd0_receiver [3066])
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: data-integrity-alg: md5
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Unknown -> Secondary )
> conn(
> >> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource )
> >> pdsk( UpToDate -> Inconsistent )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Began resync as SyncSource (will
> >> sync 5788 KB [1447 bits set]).
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Resync done (total 1 sec; paused
> 0
> >> sec; 5788 K/sec)
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( SyncSource -> Connected )
> >> pdsk( Inconsistent -> UpToDate )
> >>
> >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block
> now.
> >>
> >> Aug 27 11:44:19 pig-two -- MARK --
> >>
> >>
> >>
> >> Analysis: it look like the network is failing, then everything - under a
> >> second - re-connect, resync and work again. There are no impact on the
> >> 'production'. Anyone got some kind of idea, why ? Is it an error in my
> >> setup/design (see lower).
> >>
> >>
> >>
> >>
> >>
> >> Some background on the setup:
> >>
> >>
> >>
> >> It's an old version. Very old in fact - roadmap to upgrade has been
> >> drafted and submitted to client - I`m just wondering about the specific
> >> issue here... I want to be sure it's not an infrastructure design
> problem.
> >>
> >> pig-two:~# cat /proc/drbd
> >>
> >> version: 8.2.6 (api:88/proto:86-88)
> >>
> >> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at pig-two
> ,
> >> 2008-08-19 15:02:28
> >>
> >>  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
> >>
> >>     ns:650469968 nr:0 dw:648856776 dr:16725553 al:5463958 bm:22571 lo:0
> >> pe:0 ua:0 ap:0 oos:0
> >>
> >>
> >>
> >> We are speaking, of:
> >>
> >>  -   4x SAS 15k drives in a hardware raid-5 array (DELL
> >> Perc5)... presented to the OS as /dev/sda.
> >>
> >>  -   /dev/sda is the back-end device for DRBD... presented to the OS as
> >> /dev/drbd0
> >>
> >>  -   /dev/drbd0 is a lone "physical volume" in a volume group (called
> SAN)
> >> from which Logical Volume are created. Those are NOT locally mounted.
> >>
> >>  -   those logical volumes are exported with vblade (AoE protocol, layer
> >> 2) to some other physical system (Xen dom0) where they are used as
> backend
> >> device (/dev/etherd/e0.1) for root volume of virtual system
> >>
> >>
> >>
> >> Everything work fine, but when I do backup, I follow this process:
> >>
> >>  -  mount a CIFS exported share over the network
> >>
> >>  -  take a LV snapshot, mount it, and copy everything to the CIFS share.
> >>
> >>  -  unmount snapshot, delete it... do for all LV.
> >>
> >>  -  unmount network share
> >>
> >>
> >>
> >> The backup are consistent and valid (tested)...  What have I missed ?
> >> Should I move away from AoE to a Linux based iSCSI ?
> >>
> >>
> >>
> >> P.
> >>
> >>
> >>
> >> --
> >>
> >> Pascal Charest - Cutting-edge technology consultant
> >> https://www.labsphoenix.com
> >
> >
> > --
> > --
> > Pascal Charest - Cutting-edge technology consultant
> > Les Laboratoires Phoenix
> >
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> >
> >
>



-- 
--
Pascal Charest -* Cutting-edge technology consultant*
Les Laboratoires Phoenix <https://labsphoenix.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110830/8bab77f7/attachment.htm>


More information about the drbd-user mailing list