Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
For option 2) I am having the same problem, I roughly found it is related to that I enabled the "Storage IO control" on the parent VMFS store. Which has a congestion control throttle from 30ms to 100ms, that's not necessarily enough in this setup. So when VM write to iscsi -> iscsi target writes to VMFS -> VMFS writes get throttled -> iscsi target is staled -> VM hangs and in my case, ESXi itself also locks up because of the iscsi IO queue overflow, which is very bad :-( So, disabling that storage IO control on that vmfs store where your iscsi target host resist on, may help. also, move to a dedicated machine and a backup plan is not that expensive but more reliable. On Thu, Oct 7, 2010 at 7:43 AM, Patrick Zwahlen <paz at navixia.com> wrote: > Hi all, > > I'm looking for some inputs from the experts ! > > Short story > ----------- > zeroing my DRBD device before using it turns a non-working system into a > working one, and I'm trying to figure out why. I'm also trying to > understand if I will have other problems down the road. > > Long story > ---------- > I am building a pair of redundant iSCSI targets for VMware ESX4.1, using > the following software components: > - Fedora 12 x86_64 > - DRBD 8.3.8.1 > - pacemaker 1.0.9 > - corosync 1.2.8 > - SCST iSCSI Target (using SVN trunk, almost 2.0) > > SCST isn't cluster aware, so I'm using DRBD in primary/secondary mode. > I'm creating two iSCSI targets, one on each node, with mutual failover > and no multipath. As a reference for the discussion, I'm attaching my > resource agent, my CIB and my DRBD config files. The resource agent is a > modification of iSCSITarget/iSCSILun with some SCST specifics. > > When running this setup on a pair of physical hosts, everything works > fine. However, my interest is in small setups and I want to run the two > targets in VMs, hosted on the ESX hosts that will be the iSCSI > initiators. The market calls this a virtual SAN... I know, I know, this > is not recommended, but it definitely exists as commercial solutions, > and makes a lot of sense for small setups. I'm not looking for perf, but > for high-availability. > > This being said, I have two ways to present disk space (physical) to > DRBD (they are /dev/sdb and /dev/sdc in the VMs): > > 1) Map raid volumes to the Fedora VMs using RDM (Raw Device Mapping) > 2) Format the raid volumes with VMFS, and create virtual disks (VMDKs) > in that datastore for the Fedora VMs. > > Option 1) obviously works better, but is not always possible (many > restrictions on RAID controllers, for instance). > > Option 2) works fine until I put iSCSI WRITE load on my Fedora VM. When > using large blocks, I quickly end up with stale VMs. The iSCSI target > complains that the backend device doesn't respond, and the kernel gives > me 120 seconds timeouts for the DRBD threads. The DRBD backend devices > appear dead. At this stage, there is no iSCSI traffic anymore, CPU usage > in null, memory is fine, starvation.... Rebooting the Fedora VM solves > the problem. Seen from a DRBD/SCST point of view, it's as if backend > hardware was failing. However, physical disks/arrays are fine. The > problem is clearly within VMware. > > One of the VMware recommendation is to create the large VMDKs in > 'eagerZeroedThick', which basically zeroes everything before use. This > helps, but doesn't solve the problem completely. > > I then tried a third option: format /dev/drbd0 with XFS, create one BIG > file (using dd) on that filesystem, and export this file via iSCSI/SCST > (instead of exporting the /dev/drbd0 block device directly). I couldn't > crash this setup, but I don't like the idea of having a single 200G file > on a 99% full filesystem. > > This brought me to option 4: I directly export /dev/drbd0 via SCST (same > as option 1 and 2), but before using it, I issue a: > > dd if=/dev/zero of=/dev/drbd0 bs=4096 > > I'm now running this setup since 2 weeks, trying to put as much load as > I can on it (mainly using dd, bonnie++, DiskTT and running VMware > Storage vMotion). The only issue I have faced is that sometimes the > pacemaker 'monitor' action takes more than 20 seconds to run on DRBD, so > I have increased this timeout to 60s. Since then, no problem at all! > > As you can imagine, I'm pretty happy with the setup, but I still don't > fully understand why it now works. I hate these situations... > > Can zeroing make such a big difference ? Does it just make a difference > at the RAID/disk level, or does it also make a difference at the DRBD > level ? > > Sorry for the long e-mail, and thanks a ton for any input. - Patrick - > > PS: Based on my reading, many people are trying to implement such > solutions. XtraVirt had a VM at some point, but not anymore. People are > trying to do it with OpenFiler, but IET and VMware don't like each > other. My setup is not documented the way it should, but I'm ready to > share if anyone wants to play with it. > > > ************************************************************************************** > This email and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they > are addressed. If you have received this email in error please notify > the system manager. postmaster at navixia.com > ************************************************************************************** > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > >