[DRBD-user] Xen DomU on DRBD device: barrier errors

Thu May 31 12:13:41 CEST 2012

On Wed, May 30, 2012 at 05:13:57PM +0200, Wiebe Cazemier wrote:
> Hi, 
> 
> I'm testing setting up a Xen DomU with a DRBD storage for easy failover. Most of the time, immediately after booting the DomU, I get an IO error: 
> 
> [ 3.153370] EXT3-fs (xvda2): using internal journal 
> [ 3.277115] ip_tables: (C) 2000-2006 Netfilter Core Team 
> [ 3.336014] nf_conntrack version 0.5.0 (3899 buckets, 15596 max) 
> [ 3.515604] init: failsafe main process (397) killed by TERM signal 
> [ 3.801589] blkfront: barrier: write xvda2 op failed 
> [ 3.801597] blkfront: xvda2: barrier or flush: disabled 
> [ 3.801611] end_request: I/O error, dev xvda2, sector 52171168 
> [ 3.801630] end_request: I/O error, dev xvda2, sector 52171168 
> [ 3.801642] Buffer I/O error on device xvda2, logical block 6521396 

iirc,
that's a problem with xen blktap (or what it is called),
and supposedly fixed in later versions of that driver.
Or maybe it resurfaced as a regression again.
Workaround may be to use "phy".

> [ 3.801652] lost page write due to I/O error on xvda2 
> [ 3.801755] Aborting journal on device xvda2. 

xvda2 has likely nothing to do with your DRBD setup,
I assume that this is part of your system disk (maybe the root partition?).

As soon as DRBD is used at all, it is made responsible for everything
 :-/

At least in this case, it is not :-)

> [ 3.804415] EXT3-fs (xvda2): error: ext3_journal_start_sb: Detected aborted journal 
> [ 3.804434] EXT3-fs (xvda2): error: remounting filesystem read-only 
> [ 3.814754] journal commit I/O error 
> [ 6.973831] init: udev-fallback-graphics main process (538) terminated with status 1 
> [ 6.992267] init: plymouth-splash main process (546) terminated with status 1 
> 
> The manpage of drbdsetup says that LVM (which I use) doesn't support barriers (better known as "tagged command queuing" or "native command queuing"), so I configured the DRBD device not to use barriers. This can be seen in /proc/drbd (by "wo:f, meaning flush, the next method drbd chooses after barrier): 
> 
> 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r---- 
> ns:2160152 nr:520204 dw:2680344 dr:2678107 al:3549 bm:9183 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 
> 
> And on the other host: 
> 
> 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r---- 
> ns:0 nr:2160152 dw:2160152 dr:0 al:0 bm:8052 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 
> 
> I also enabled the option disable_sendpage, as per the DRBD docs: 
> 
> cat /sys/module/drbd/parameters/disable_sendpage 
> Y 
> 
> I also tried adding barriers=0 to fstab as mount option. Still it says: 
> 
> [ 58.603896] blkfront: barrier: write xvda2 op failed 
> [ 58.603903] blkfront: xvda2: barrier or flush: disabled 
> 
> I don't even know if ext3 has a nobarrier option, but it does seem to work. But, because only one of my storage systems is battery backed, it would not be smart. 
> 
> Why does it still compain about barriers when I disabled that? 
> 
> Both hosts are: 
> 
> Debian: 6.0.4 
> uname -a: Linux 2.6.32-5-xen-amd64 
> drbd: 8.3.7 
> Xen: 4.0.1 
> 
> Guest: 
> 
> Ubuntu 12.04 LTS 
> uname -a: Linux 3.2.0-24-generic pvops 
> 
> drbd resource: 
> 
> resource drbdvm 
> { 
> meta-disk internal; 
> device /dev/drbd3; 
> 
> startup 
> { 
> # The timeout value when the last known state of the other side was available. 0 means infinite. 
> wfc-timeout 0; 
> 
> # Timeout value when the last known state was disconnected. 0 means infinite. 
> degr-wfc-timeout 180; 
> } 
> 
> syncer 
> { 
> # This is recommended only for low-bandwidth lines, to only send those 
> # blocks which really have changed. 
> #csums-alg md5; 
> 
> # Set to about half your net speed 
> rate 60M; 
> 
> # It seems that this option moved to the 'net' section in drbd 8.4. (later release than Debian has currently) 
> verify-alg md5; 
> } 
> 
> net 
> { 
> # The manpage says this is recommended only in pre-production (because of its performance), to determine 
> # if your LAN card has a TCP checksum offloading bug. 
> #data-integrity-alg md5; 
> } 
> 
> disk 
> { 
> # Detach causes the device to work over-the-network-only after the 
> # underlying disk fails. Detach is not default for historical reasons, but is 
> # recommended by the docs. 
> # However, the Debian defaults in drbd.conf suggest the machine will reboot in that event... 
> on-io-error detach; 
> 
> # LVM doesn't support barriers, so disabling it. It will revert to flush. Check wo: in /proc/drbd. If you don't disable it, you get IO errors. 
> no-disk-barrier; 
> } 
> 
> on host1 
> { 
> # universe is a VG 
> disk /dev/universe/drbdvm-disk; 
> address 10.0.0.1:7792; 
> } 
> 
> on host2 
> { 
> # universe is a VG 
> disk /dev/universe/drbdvm-disk; 
> address 10.0.0.2:7792; 
> } 
> } 
> 
> 
> In my test setup: the primary host's storage is 9650SE SATA-II RAID PCIe with battery. The secondary is software RAID1. 
> 
> 
> Isn't DRBD+Xen widely used? With these problems, it's not going to work. 
> 
> 
> Any help welcome. 
> 

> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed