[DRBD-user] VMware ESX 3 + iSCSI Enterprise Target/DRBD gone terribly wrong - help!

Lars Ellenberg lars.ellenberg at linbit.com
Wed Aug 1 22:18:02 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Jul 03, 2007 at 12:38:23AM -0700, Jakobsen wrote:
> I have some critical issues with three ESX 3.0.1 servers, that access
> an iSCSI Enterprise Target. The iSCSI target is replicating with
> another server with DRBD, and everything works fine WITHOUT DRBD
> ENABLED.
> 
> When I enable DRBD, it starts a full sync that takes about 1 hour to
> complete, and everything seems fine. After the full sync, DRBD is not
> under heavy load anymore. Suddenly - without any errors on the DRBD
> servers - the VMware guests starts throwing I/O errors at me, and
> everything goes read-only.
> 
> Have any of you guys got the same problem?
> I have no clue what the problem can be.

meanwhile...  they hired me for consulting.

what we found is basically that it is only the shifted timing
when you add drbd into the picture that makes it more likely
to _triggers_ the issue.

when you stress the vm clients (or the iscsi server, or both),
you will hit the very same problem anyways,
without drbd being involved.

it is actually a problem combining (certain versions of) the linux guest
kernel scsi drivers (mptscsih) with the ESX initiator and some "hickups"
(scsi timeouts) on the iscsi side, and it is independent on whether you
use software or hardware initiator, or what sort of iSCSI target you use
(ietd base linux stuff, or EMC SAN or any other SAN box).
you only change the likelyhood to trigger the problem.

the issue (and workaround/fix) is well documented in at least these
forum threads, blogs, vmware advisory and redhat bugzillas:
[1] http://www.vmware.com/community/thread.jspa?threadID=58121&tstart=0
[2] http://www.vmware.com/community/thread.jspa?threadID=58081&tstart=0
[3] http://www.tuxyturvy.com/blog/index.php?/archives/31-VMware-ESX-and-ext3-journal-aborts.html
[4] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=197158
[5] https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=228108
[6] http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalId=51306

the solution is to change the guest linux kernel, or at least
patch its mptscsih driver module as explained in [3] and [1].

cheers,

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :



More information about the drbd-user mailing list