[DRBD-user] Effects of zeroing a DRBD device before use

Thu Oct 7 13:43:45 CEST 2010

Hi all,

I'm looking for some inputs from the experts !

Short story
-----------
zeroing my DRBD device before using it turns a non-working system into a
working one, and I'm trying to figure out why. I'm also trying to
understand if I will have other problems down the road.

Long story
----------
I am building a pair of redundant iSCSI targets for VMware ESX4.1, using
the following software components:
- Fedora 12 x86_64
- DRBD 8.3.8.1
- pacemaker 1.0.9
- corosync 1.2.8
- SCST iSCSI Target (using SVN trunk, almost 2.0)

SCST isn't cluster aware, so I'm using DRBD in primary/secondary mode.
I'm creating two iSCSI targets, one on each node, with mutual failover
and no multipath. As a reference for the discussion, I'm attaching my
resource agent, my CIB and my DRBD config files. The resource agent is a
modification of iSCSITarget/iSCSILun with some SCST specifics.

When running this setup on a pair of physical hosts, everything works
fine. However, my interest is in small setups and I want to run the two
targets in VMs, hosted on the ESX hosts that will be the iSCSI
initiators. The market calls this a virtual SAN... I know, I know, this
is not recommended, but it definitely exists as commercial solutions,
and makes a lot of sense for small setups. I'm not looking for perf, but
for high-availability.

This being said, I have two ways to present disk space (physical) to
DRBD (they are /dev/sdb and /dev/sdc in the VMs):

1) Map raid volumes to the Fedora VMs using RDM (Raw Device Mapping)
2) Format the raid volumes with VMFS, and create virtual disks (VMDKs)
in that datastore for the Fedora VMs.

Option 1) obviously works better, but is not always possible (many
restrictions on RAID controllers, for instance).

Option 2) works fine until I put iSCSI WRITE load on my Fedora VM. When
using large blocks, I quickly end up with stale VMs. The iSCSI target
complains that the backend device doesn't respond, and the kernel gives
me 120 seconds timeouts for the DRBD threads. The DRBD backend devices
appear dead. At this stage, there is no iSCSI traffic anymore, CPU usage
in null, memory is fine, starvation.... Rebooting the Fedora VM solves
the problem. Seen from a DRBD/SCST point of view, it's as if backend
hardware was failing. However, physical disks/arrays are fine. The
problem is clearly within VMware.

One of the VMware recommendation is to create the large VMDKs in
'eagerZeroedThick', which basically zeroes everything before use. This
helps, but doesn't solve the problem completely.

I then tried a third option: format /dev/drbd0 with XFS, create one BIG
file (using dd) on that filesystem, and export this file via iSCSI/SCST
(instead of exporting the /dev/drbd0 block device directly). I couldn't
crash this setup, but I don't like the idea of having a single 200G file
on a 99% full filesystem.

This brought me to option 4: I directly export /dev/drbd0 via SCST (same
as option 1 and 2), but before using it, I issue a:

	dd if=/dev/zero of=/dev/drbd0 bs=4096

I'm now running this setup since 2 weeks, trying to put as much load as
I can on it (mainly using dd, bonnie++, DiskTT and running VMware
Storage vMotion). The only issue I have faced is that sometimes the
pacemaker 'monitor' action takes more than 20 seconds to run on DRBD, so
I have increased this timeout to 60s. Since then, no problem at all!

As you can imagine, I'm pretty happy with the setup, but I still don't
fully understand why it now works. I hate these situations...

Can zeroing make such a big difference ? Does it just make a difference
at the RAID/disk level, or does it also make a difference at the DRBD
level ?

Sorry for the long e-mail, and thanks a ton for any input. - Patrick -

PS: Based on my reading, many people are trying to implement such
solutions. XtraVirt had a VM at some point, but not anymore. People are
trying to do it with OpenFiler, but IET and VMware don't like each
other. My setup is not documented the way it should, but I'm ready to
share if anyone wants to play with it.

**************************************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager. postmaster at navixia.com
**************************************************************************************
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: cib.txt
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101007/1896c9c1/attachment.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: scstlun
Type: application/octet-stream
Size: 9175 bytes
Desc: scstlun
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101007/1896c9c1/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_a.res
Type: application/octet-stream
Size: 223 bytes
Desc: drbd_a.res
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101007/1896c9c1/attachment-0001.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_b.res
Type: application/octet-stream
Size: 223 bytes
Desc: drbd_b.res
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101007/1896c9c1/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: global_common.conf
Type: application/octet-stream
Size: 407 bytes
Desc: global_common.conf
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20101007/1896c9c1/attachment-0003.obj>