[DRBD-user] Problem (Total-Crash) with XFS on LVM2 on DRBD. Reason or Solution?

Mon Jul 20 14:18:35 CEST 2009

Hi,

I run a Linux Fileserver on Debian 5.0 Lenny.
There are a RAID-1 device for the system-partitions an a RAID-5 device for
the data-partitions.

DRBD attaches on the RAID-5 device and offers /dev/drbd0 as physical device
for the LVM.
LVM provides a couple (about 25) of logical volumes formatted with XFS
filesystem.

Everything went fine during testing for about 2 weeks but a few hours after
integrating the server to the productive area, the Filesystem on the first
logical volume crashes.
After a complete system-reboot, LVM can´t find ANY of the physical volumes
(drbd0) (tested with pvscan)

The kernel says:

Jul 19 13:36:03 superserver kernel: [115605.913825] 00000000: 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00  ................
Jul 19 13:36:03 superserver kernel: [115605.913825] Filesystem "dm-0": XFS
internal error xfs_da_do_buf(2) at line 2085 of file fs/xfs/xfs_da_btree.c. 
Calle
r 0xfc9fcc69
Jul 19 13:36:03 superserver kernel: [115605.913825] Pid: 10423, comm: smbd
Not tainted 2.6.26-2-686 #1
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9fcb3e>]
xfs_da_do_buf+0x5ef/0x6bd [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9fcc69>]
xfs_da_read_buf+0x19/0x1e [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9fcc69>]
xfs_da_read_buf+0x19/0x1e [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<f89c176b>]
do_get_write_access+0x2f8/0x331 [jbd]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9fcc69>]
xfs_da_read_buf+0x19/0x1e [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9ff20a>]
xfs_dir2_block_lookup_int+0x3d/0x177 [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9ff20a>]
xfs_dir2_block_lookup_int+0x3d/0x177 [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c0129b12>]
lock_timer_base+0x19/0x35
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9ffa8e>]
xfs_dir2_block_lookup+0x16/0x96 [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9fe6c5>]
xfs_dir2_isblock+0x14/0x58 [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fc9ff02d>]
xfs_dir_lookup+0x98/0xe7 [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c02865ed>]
__tcp_push_pending_frames+0x619/0x6b2
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fca24a6c>]
xfs_lookup+0x40/0x8a [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<fca2f2b6>]
xfs_vn_lookup+0x35/0x6b [xfs]
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c017a607>]
do_lookup+0xb6/0x153
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c017c1fb>]
__link_path_walk+0x726/0xb0d
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c015896e>]
filemap_fault+0x1f5/0x35b
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c017c619>]
path_walk+0x37/0x70
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c017c8c8>]
do_path_lookup+0x122/0x184
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c017d125>]
__user_walk_fd+0x29/0x3a
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c01772c6>]
vfs_stat_fd+0x15/0x3c
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c01773a2>]
sys_stat64+0xf/0x23
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c0115b67>]
do_page_fault+0x29b/0x5b8
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c0174ea3>]
sys_read+0x3c/0x63
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c01158cc>]
do_page_fault+0x0/0x5b8
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c0103853>]
sysenter_past_esp+0x78/0xb1
Jul 19 13:36:03 superserver kernel: [115605.913825]  [<c02b0000>]
aer_probe+0x7d/0xff
Jul 19 13:36:03 superserver kernel: [115605.913825]  =======================

This log-"block" repeats nearly every second until the complete
system-reboot.

Currently no data-partition is available.

Does anyone know the reason or better have an solution for this problem?
Doesn´t XFS work with DRBD using protocol A? Does XFS shoot up the hole LVM
AND DRBD-Device? Is a kernel-bug possibly the reason?
I´m on the brink to reconfigure DRBD & LVM, but scared of having such a
crash again...

I would be happy to get an answer. Thanks a lot!

Here my drbd.conf:

global {
  usage-count no;
}

resource r0 {
  protocol A;

  startup {
    wfc-timeout        120;  ## 2 minutes.
    degr-wfc-timeout  120;  ## 2 minutes.
  }

  disk {
    on-io-error detach;
  }

  net {
    max-buffers 8192;
    unplug-watermark 8192;
  }

  syncer {
    rate 100M;
    al-extents 3833;
  }

  on server {
    device     /dev/drbd0;
    disk       /dev/sdb;
    address    10.0.0.11:7788;
    meta-disk  internal;
  }

  on backupserver {
    device     /dev/drbd0;
    disk       /dev/sdb;
    address    10.0.0.12:7788;
    meta-disk  internal;
  }
}
-- 
View this message in context: http://www.nabble.com/Problem-%28Total-Crash%29-with-XFS-on-LVM2-on-DRBD.-Reason-or-Solution--tp24568727p24568727.html
Sent from the DRBD - User mailing list archive at Nabble.com.