[DRBD-user] Load high on primary node while doing backup on secondary

Wed Apr 30 22:31:17 CEST 2014

On Wed, Apr 23, 2014 at 10:16:24AM -0700, Irwin Nemetz wrote:
> I have a two node cluster. There are 3 mail nodes running as KVM virtual
> machines on one node. The 3 VM's sit on top of a DRBD disk on a LVM volume
> which replicates to the passive 2nd node.
> 
> Hardware: 2x16 core AMD processors, 128gb memory, 5 3tb sas drives in a raid5

You likely won't get "terrific" performance out of a few large drivers in raid5.

> The drbd replication is over a crossover cable.
> 
> version: 8.4.4 (api:1/proto:86-101)
> GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil at Build64R6, 2013-10-14 15:33:06

> resource zapp
> {
>   startup {
>     wfc-timeout 10;
>     outdated-wfc-timeout 10;
>     degr-wfc-timeout 10;
>   }
>   disk {
>     on-io-error detach; 
>     rate 40M;
>     al-extents 3389;
>   }
>   net {
>    verify-alg sha1;
>    max-buffers 8000;
>    max-epoch-size 8000;
>    sndbuf-size 512k;
>    cram-hmac-alg sha1;
>    shared-secret sync_disk;
>    data-integrity-alg sha1;

Don't enable data-integrity in production.  It will just burn your
cycles and limit your throughput to however fast your core can crunch sha1.

That's a *diagnostic* feature.
It does not really do anything for the integrity of your data,
it just happens to help to *detect* when integrity *may* be compromised.
We should have named that
"burn-cpu-cycles-and-calculate-extra-checksums-for-diagnostic-purposes".

>   }
>   on nodea.cluster.dns {
>    device /dev/drbd1;
>    disk /dev/virtimages/zapp;
>    address 10.88.88.171:7787;
>    meta-disk internal;
>   }
>   on nodeb.cluster.dns {
>    device /dev/drbd1;
>    disk /dev/virtimages/zapp;
>    address 10.88.88.172:7787;
>    meta-disk internal;
>   }
> }

You could probably do a lot of tuning ...

> I am trying to do a backup of the VM's nightly. They are about 2.7TB each.
> I create a snapshot on the backup node, mount it and then do a copy to a
> NAS backup storage device. The NAS is on it's own network.
> 
> Here's the script:
> 
> [root at nodeb ~]# cat backup-zapp.sh
> #!/bin/bash
> 
> date
> cat > /etc/drbd.d/snap.res <<EOF

*OUTCH*

What would you do that.

There is no point to create a throw away DRBD resource automatically
to access a snapshot from below an other DRBD.

Just use that snapshot directly.

Or am I missing something?

> /sbin/lvcreate -L500G -s -n snap-zapp /dev/virtimages/zapp
> 
> /sbin/drbdadm up snap

No need. Really.

> sleep 2
> /sbin/drbdadm primary snap
> mount -t ext4 /dev/drbd99 /mnt/zapp

instead, this should do all you need:
mount -t ext4 -o ro /dev/virtimages/snap-zapp /mnt/zapp

> cd /rackstation/images
> mv -vf zapp.img zapp.img.-1
> mv -vf zapp-opt.img zapp-opt.img.-1
> cp -av /mnt/zapp/*.img /rackstation/images
> umount /mnt/zapp
> /sbin/drbdadm down snap
> rm -f /etc/drbd.d/snap.res
> /sbin/lvremove -f /dev/virtimages/snap-zapp
> date
> 
> About half way thru the copy, the copy starts stuttering (network traffic
> stops and starts) and the load on the primary machine and the virtual
> machine being copied shoots thru the roof.

Maybe snapshots filling up, disks being slow and all,
and everything that is written on the primary needs to be written on the
secondary *twice* now, while you are hammering that secondary IO
subsystem with reads...
of course that impacts the Primary, as soon as the secondary RAID 5
can no longer keep up with the load you throw at it.

> I am at lose to explain this since it's dealing with a snapshot of a
> volume on a replicated node. The only reasonable explanation I can think
> of is that the drbd replication is being blocked by something and this is
> causing the disk on the primary node to become unresponsive.

Yep.
See above.

You could try to use rsync --bwlimit instead of cp, that will reduce the
read load, but it will also prolong the lifetime of the snapshot,
so it may or may not actually help.

Or maybe you just need to defrag your VM images...
possibly they are fragmented, and what you see is the secondary IO
subsystem hitting the IOPS limit while trying to seek through all the VM
image fragments...

Or use a "stacked" DBRD setup,
and disconnect the third node.
Or, if you can live with reduced redundancy during the backup,
disconnect the secondary for that time.

Or add a dedicated PV for the snapshot "exeption store",
or add non-volatile cache to your RAID.
or a number of other options.

Thing is, if you stress the secondary IO subsystem enough,
that *will* impact the (write performance on the) primary.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed