Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm getting kind of ashamed here. I already mentioned its an old setup, but, yeah - the whole process is to get it updated to latest kernel + upgrade the hardware... I'm just curious about the specific issue, to know if its a flaw in the design. So, GNU/Linux distribution is Debian-4.0, running under kernel 2.6.18-6-686-bigmem. The 'fileserver' doesn't have XEN installed, but I'm pretty sure my exported raw device from AoE are equivalent to your disks as files on NFS from I/O point of view. My export are through a single ethernet 1gbps link with no bonding installed (yet). P. On Tue, Aug 30, 2011 at 6:30 AM, Martin Rusko <martin.rusko at gmail.com>wrote: > Pascal, > > what is the kernel and distribution you're running there, please? I'm > just curious, as I see somewhat similar behavior with two nodes > running drbd, ocfs2, corosync+pacemaker and xen to host couple of > virtual guests. As a proof-of-concept, I have some guests having disks > as files on NFS mounted directory from external NFS server. If there > is heavy IO in these virtual machines, I can observe very short drbd > disconnections and also corosync complains about being paused for two > long (up to 16seconds!, normally it sends some traffic over the > network 3 times per second). When corosync is paused for as long as > those 16 seconds, that node gets "stonithed" by remaining cluster > members. > > My setup is Debian/Squeeze with packages from official repositories, > with kernel 2.6.32-5-xen-amd64. I'm still running around like headless > chicken, trying different things, right now to run kernel with > CONFIG_PREEMPT=y or maybe a different kernel version. Having some > experience with linux kernel tracing, maybe it would be possible what > blocks execution of drbd or corosync processes making them to start > failing. > > Best Regards, > Martin > > > > On Sun, Aug 28, 2011 at 3:59 PM, Pascal Charest > <pascal.charest at labsphoenix.com> wrote: > > Hi, > > It always `worked` - it doesn't crash. Only the communication seem to get > > interrupted for a few seconds while backup are being taken. Backup are > valid > > and the setup can survive with a few seconds where redundancy is not > > available. > > I should have asked that question when I build the setup 4 years ago, > but... > > yeah... and now I'm trying to fix everything up for that client. > > The broken communication seems to happen only when I'm mounting the > backup > > snapshot and taking RAR from it. Might be a problem on the AoE side of > > things along with a LVM snapshot. > > > > P. > > > > On Sun, Aug 28, 2011 at 9:18 AM, Pascal BERTON <pascal.berton3 at free.fr> > > wrote: > >> > >> Pascal, > >> > >> > >> > >> One thing is unclear : did it used to work in the past (and if yes what > >> has changed lately that could explain this behavior) or is it a new > feature > >> you’ve just added to your customer’s config ? > >> > >> Furthermore, I suspect you have scripted all this process haven’t you ? > If > >> so, have you identified which step induces this communication > disruption? > >> Have you tried to execute manually this sequence and then at what step > does > >> it happen ? > >> > >> > >> > >> Best regards, > >> > >> > >> > >> Pascal. > >> > >> > >> > >> De : drbd-user-bounces at lists.linbit.com > >> [mailto:drbd-user-bounces at lists.linbit.com] De la part de Pascal > Charest > >> Envoyé : samedi 27 août 2011 22:52 > >> À : drbd-user at lists.linbit.com > >> Objet : [DRBD-user] Frequent disconnect when doing backup. > >> > >> > >> > >> Hi, > >> > >> > >> > >> I have a small issue with one of my DRBD setup. When my backup is > running > >> (-see lower for setup and backup details), i`m getting those errors: > >> > >> > >> > >> Aug 27 10:24:18 pig-two -- MARK -- > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: peer( Secondary -> Unknown ) > conn( > >> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: asender terminated > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: Terminating asender thread > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: sock was reset by peer > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: _drbd_send_page: size=4096 > len=3064 > >> sent=-32 > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: Creating new current UUID > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: tl_clear() > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: Connection closed > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: conn( NetworkFailure -> > Unconnected > >> ) > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: receiver terminated > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: receiver (re)started > >> > >> Aug 27 10:27:26 pig-two kernel: drbd0: conn( Unconnected -> WFConnection > ) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Handshake successful: Agreed > >> network protocol version 88 > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Peer authenticated using 20 bytes > >> of 'sha1' HMAC > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFConnection -> > >> WFReportParams ) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Starting asender thread (from > >> drbd0_receiver [3066]) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: data-integrity-alg: md5 > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: peer( Unknown -> Secondary ) > conn( > >> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource ) > >> pdsk( UpToDate -> Inconsistent ) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Began resync as SyncSource (will > >> sync 2160 KB [540 bits set]). > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Resync done (total 1 sec; paused > 0 > >> sec; 2160 K/sec) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: conn( SyncSource -> Connected ) > >> pdsk( Inconsistent -> UpToDate ) > >> > >> Aug 27 10:27:27 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 10:44:19 pig-two -- MARK -- > >> > >> > >> > >> and > >> > >> > >> > >> Aug 27 11:04:19 pig-two -- MARK -- > >> > >> Aug 27 11:20:36 pig-two kernel: drbd0: _drbd_send_page: size=4096 > len=4096 > >> sent=-104 > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Secondary -> Unknown ) > conn( > >> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Creating new current UUID > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: asender terminated > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Terminating asender thread > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: sock was shut down by peer > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: tl_clear() > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Connection closed > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( NetworkFailure -> > Unconnected > >> ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: receiver terminated > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: receiver (re)started > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( Unconnected -> WFConnection > ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Handshake successful: Agreed > >> network protocol version 88 > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Peer authenticated using 20 bytes > >> of 'sha1' HMAC > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFConnection -> > >> WFReportParams ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Starting asender thread (from > >> drbd0_receiver [3066]) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: data-integrity-alg: md5 > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: peer( Unknown -> Secondary ) > conn( > >> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( WFBitMapS -> SyncSource ) > >> pdsk( UpToDate -> Inconsistent ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Began resync as SyncSource (will > >> sync 5788 KB [1447 bits set]). > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Resync done (total 1 sec; paused > 0 > >> sec; 5788 K/sec) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: conn( SyncSource -> Connected ) > >> pdsk( Inconsistent -> UpToDate ) > >> > >> Aug 27 11:20:37 pig-two kernel: drbd0: Writing meta data super block > now. > >> > >> Aug 27 11:44:19 pig-two -- MARK -- > >> > >> > >> > >> Analysis: it look like the network is failing, then everything - under a > >> second - re-connect, resync and work again. There are no impact on the > >> 'production'. Anyone got some kind of idea, why ? Is it an error in my > >> setup/design (see lower). > >> > >> > >> > >> > >> > >> Some background on the setup: > >> > >> > >> > >> It's an old version. Very old in fact - roadmap to upgrade has been > >> drafted and submitted to client - I`m just wondering about the specific > >> issue here... I want to be sure it's not an infrastructure design > problem. > >> > >> pig-two:~# cat /proc/drbd > >> > >> version: 8.2.6 (api:88/proto:86-88) > >> > >> GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by root at pig-two > , > >> 2008-08-19 15:02:28 > >> > >> 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- > >> > >> ns:650469968 nr:0 dw:648856776 dr:16725553 al:5463958 bm:22571 lo:0 > >> pe:0 ua:0 ap:0 oos:0 > >> > >> > >> > >> We are speaking, of: > >> > >> - 4x SAS 15k drives in a hardware raid-5 array (DELL > >> Perc5)... presented to the OS as /dev/sda. > >> > >> - /dev/sda is the back-end device for DRBD... presented to the OS as > >> /dev/drbd0 > >> > >> - /dev/drbd0 is a lone "physical volume" in a volume group (called > SAN) > >> from which Logical Volume are created. Those are NOT locally mounted. > >> > >> - those logical volumes are exported with vblade (AoE protocol, layer > >> 2) to some other physical system (Xen dom0) where they are used as > backend > >> device (/dev/etherd/e0.1) for root volume of virtual system > >> > >> > >> > >> Everything work fine, but when I do backup, I follow this process: > >> > >> - mount a CIFS exported share over the network > >> > >> - take a LV snapshot, mount it, and copy everything to the CIFS share. > >> > >> - unmount snapshot, delete it... do for all LV. > >> > >> - unmount network share > >> > >> > >> > >> The backup are consistent and valid (tested)... What have I missed ? > >> Should I move away from AoE to a Linux based iSCSI ? > >> > >> > >> > >> P. > >> > >> > >> > >> -- > >> > >> Pascal Charest - Cutting-edge technology consultant > >> https://www.labsphoenix.com > > > > > > -- > > -- > > Pascal Charest - Cutting-edge technology consultant > > Les Laboratoires Phoenix > > > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > > > > -- -- Pascal Charest -* Cutting-edge technology consultant* Les Laboratoires Phoenix <https://labsphoenix.com> -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20110830/8bab77f7/attachment.htm>