[DRBD-user] Effects of zeroing a DRBD device before use

Thu Oct 7 18:40:48 CEST 2010

For option 2) I am having the same problem, I roughly found it is
related to that I enabled the "Storage IO control" on the parent VMFS
store.  Which has a congestion control throttle from 30ms to 100ms,
that's not necessarily enough in this setup.

So when VM write to iscsi -> iscsi target writes to VMFS -> VMFS
writes get throttled -> iscsi target is staled -> VM hangs

and in my case, ESXi itself also locks up because of the iscsi IO
queue overflow, which is very bad :-(

So, disabling that storage IO control on that vmfs store where your
iscsi target host resist on, may help. also, move to a dedicated
machine and a backup plan is not that expensive but more reliable.

On Thu, Oct 7, 2010 at 7:43 AM, Patrick Zwahlen <paz at navixia.com> wrote:
> Hi all,
>
> I'm looking for some inputs from the experts !
>
> Short story
> -----------
> zeroing my DRBD device before using it turns a non-working system into a
> working one, and I'm trying to figure out why. I'm also trying to
> understand if I will have other problems down the road.
>
> Long story
> ----------
> I am building a pair of redundant iSCSI targets for VMware ESX4.1, using
> the following software components:
> - Fedora 12 x86_64
> - DRBD 8.3.8.1
> - pacemaker 1.0.9
> - corosync 1.2.8
> - SCST iSCSI Target (using SVN trunk, almost 2.0)
>
> SCST isn't cluster aware, so I'm using DRBD in primary/secondary mode.
> I'm creating two iSCSI targets, one on each node, with mutual failover
> and no multipath. As a reference for the discussion, I'm attaching my
> resource agent, my CIB and my DRBD config files. The resource agent is a
> modification of iSCSITarget/iSCSILun with some SCST specifics.
>
> When running this setup on a pair of physical hosts, everything works
> fine. However, my interest is in small setups and I want to run the two
> targets in VMs, hosted on the ESX hosts that will be the iSCSI
> initiators. The market calls this a virtual SAN... I know, I know, this
> is not recommended, but it definitely exists as commercial solutions,
> and makes a lot of sense for small setups. I'm not looking for perf, but
> for high-availability.
>
> This being said, I have two ways to present disk space (physical) to
> DRBD (they are /dev/sdb and /dev/sdc in the VMs):
>
> 1) Map raid volumes to the Fedora VMs using RDM (Raw Device Mapping)
> 2) Format the raid volumes with VMFS, and create virtual disks (VMDKs)
> in that datastore for the Fedora VMs.
>
> Option 1) obviously works better, but is not always possible (many
> restrictions on RAID controllers, for instance).
>
> Option 2) works fine until I put iSCSI WRITE load on my Fedora VM. When
> using large blocks, I quickly end up with stale VMs. The iSCSI target
> complains that the backend device doesn't respond, and the kernel gives
> me 120 seconds timeouts for the DRBD threads. The DRBD backend devices
> appear dead. At this stage, there is no iSCSI traffic anymore, CPU usage
> in null, memory is fine, starvation.... Rebooting the Fedora VM solves
> the problem. Seen from a DRBD/SCST point of view, it's as if backend
> hardware was failing. However, physical disks/arrays are fine. The
> problem is clearly within VMware.
>
> One of the VMware recommendation is to create the large VMDKs in
> 'eagerZeroedThick', which basically zeroes everything before use. This
> helps, but doesn't solve the problem completely.
>
> I then tried a third option: format /dev/drbd0 with XFS, create one BIG
> file (using dd) on that filesystem, and export this file via iSCSI/SCST
> (instead of exporting the /dev/drbd0 block device directly). I couldn't
> crash this setup, but I don't like the idea of having a single 200G file
> on a 99% full filesystem.
>
> This brought me to option 4: I directly export /dev/drbd0 via SCST (same
> as option 1 and 2), but before using it, I issue a:
>
>        dd if=/dev/zero of=/dev/drbd0 bs=4096
>
> I'm now running this setup since 2 weeks, trying to put as much load as
> I can on it (mainly using dd, bonnie++, DiskTT and running VMware
> Storage vMotion). The only issue I have faced is that sometimes the
> pacemaker 'monitor' action takes more than 20 seconds to run on DRBD, so
> I have increased this timeout to 60s. Since then, no problem at all!
>
> As you can imagine, I'm pretty happy with the setup, but I still don't
> fully understand why it now works. I hate these situations...
>
> Can zeroing make such a big difference ? Does it just make a difference
> at the RAID/disk level, or does it also make a difference at the DRBD
> level ?
>
> Sorry for the long e-mail, and thanks a ton for any input. - Patrick -
>
> PS: Based on my reading, many people are trying to implement such
> solutions. XtraVirt had a VM at some point, but not anymore. People are
> trying to do it with OpenFiler, but IET and VMware don't like each
> other. My setup is not documented the way it should, but I'm ready to
> share if anyone wants to play with it.
>
>
> **************************************************************************************
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the system manager. postmaster at navixia.com
> **************************************************************************************
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>
>