<html>
<head>
<style><!--
.hmmessage P
{
margin:0px;
padding:0px
}
body.hmmessage
{
font-size: 12pt;
font-family:Calibri
}
--></style></head>
<body class='hmmessage'><div dir='ltr'><div>I have a two node cluster. There are 3 mail nodes running as KVM virtual</div><div>machines on one node. The 3 VM's sit on top of a DRBD disk on a LVM volume</div><div>which replicates to the passive 2nd node.</div><div><br></div><div>Hardware: 2x16 core AMD processors, 128gb memory, 5 3tb sas drives in a raid5</div><div><br></div><div>The drbd replication is over a crossover cable.</div><div><br></div><div>version: 8.4.4 (api:1/proto:86-101)</div><div>GIT-hash: 599f286440bd633d15d5ff985204aff4bccffadd build by phil@Build64R6, 2013-10-14 15:33:06</div><div><br></div><div> 1: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----</div><div> ns:0 nr:1029579824 dw:1029579824 dr:0 al:0 bm:176936 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div><div> 2: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----</div><div> ns:0 nr:1117874156 dw:1117874156 dr:0 al:0 bm:176928 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div><div> 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----</div><div> ns:0 nr:1443855844 dw:1443855844 dr:0 al:0 bm:196602 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0</div><div><br></div><div>resource zapp</div><div>{</div><div> startup {</div><div> wfc-timeout 10;</div><div> outdated-wfc-timeout 10;</div><div> degr-wfc-timeout 10;</div><div> }</div><div> disk {</div><div> on-io-error detach; </div><div> rate 40M;</div><div> al-extents 3389;</div><div> }</div><div> net {</div><div> verify-alg sha1;</div><div> max-buffers 8000;</div><div> max-epoch-size 8000;</div><div> sndbuf-size 512k;</div><div> cram-hmac-alg sha1;</div><div> shared-secret sync_disk;</div><div> data-integrity-alg sha1;</div><div> }</div><div> on nodea.cluster.dns {</div><div> device /dev/drbd1;</div><div> disk /dev/virtimages/zapp;</div><div> address 10.88.88.171:7787;</div><div> meta-disk internal;</div><div> }</div><div> on nodeb.cluster.dns {</div><div> device /dev/drbd1;</div><div> disk /dev/virtimages/zapp;</div><div> address 10.88.88.172:7787;</div><div> meta-disk internal;</div><div> }</div><div>}</div><div><br></div><div>I am trying to do a backup of the VM's nightly. They are about 2.7TB each.</div><div>I create a snapshot on the backup node, mount it and then do a copy to a</div><div>NAS backup storage device. The NAS is on it's own network.</div><div><br></div><div>Here's the script:</div><div><br></div><div>[root@nodeb ~]# cat backup-zapp.sh</div><div>#!/bin/bash</div><div><br></div><div>date</div><div>cat > /etc/drbd.d/snap.res <<EOF</div><div>resource snap</div><div>{</div><div> on nodea.cluster.dns {</div><div> device /dev/drbd99;</div><div> disk /dev/virtimages/snap-zapp;</div><div> address 10.88.88.171:7999;</div><div> meta-disk internal;</div><div> }</div><div> on nodeb.cluster.dns {</div><div> device /dev/drbd99;</div><div> disk /dev/virtimages/snap-zapp;</div><div> address 10.88.88.172:7999;</div><div> meta-disk internal;</div><div> }</div><div>}</div><div>EOF</div><div><br></div><div>/sbin/lvcreate -L500G -s -n snap-zapp /dev/virtimages/zapp</div><div><br></div><div>/sbin/drbdadm up snap</div><div>sleep 2</div><div>/sbin/drbdadm primary snap</div><div>mount -t ext4 /dev/drbd99 /mnt/zapp</div><div>cd /rackstation/images</div><div>mv -vf zapp.img zapp.img.-1</div><div>mv -vf zapp-opt.img zapp-opt.img.-1</div><div>cp -av /mnt/zapp/*.img /rackstation/images</div><div>umount /mnt/zapp</div><div>/sbin/drbdadm down snap</div><div>rm -f /etc/drbd.d/snap.res</div><div>/sbin/lvremove -f /dev/virtimages/snap-zapp</div><div>date</div><div><br></div><div>About half way thru the copy, the copy starts stuttering (network traffic</div><div>stops and starts) and the load on the primary machine and the virtual</div><div>machine being copied shoots thru the roof.</div><div><br></div><div>I am at lose to explain this since it's dealing with a snapshot of a</div><div>volume on a replicated node. The only reasonable explanation I can think</div><div>of is that the drbd replication is being blocked by something and this is</div><div>causing the disk on the primary node to become unresponsive.</div><div><br></div><div>Irwin</div>                                            </div></body>
</html>