[DRBD-user] DRBD crash with "attempt to write beyond end of device"

Fri Feb 8 16:19:49 CET 2008

I'm really getting desparate on this, as we are currently not in a high
availability state with our server, so I thought I'd include some more
info. Attached is my drbd.conf. Also, I am running RHEL5
2.6.18-8.1.14.el5 on both systems. Below is a capture from my system
messages log from the original failure:

Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300760,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300800,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300864,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300928,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300992,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234301016,
limit=234300736
Feb  7 05:41:50 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:50 arc-swilliamslx kernel: drbd0: rw=0, want=234300744,
limit=234300736
Feb  7 05:41:57 arc-swilliamslx kernel: attempt to access beyond end of
device
Feb  7 05:41:57 arc-swilliamslx kernel: drbd0: rw=0, want=234303728,
limit=234300736
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0):
ext3_free_branches: Read failure, inode=14209948, block=
29287965
Feb  7 05:41:57 arc-swilliamslx kernel: Aborting journal on device
drbd0.
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_reserve_inode_write: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_truncate: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_reserve_inode_write: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_orphan_del: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_reserve_inode_write: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0) in
ext3_delete_inode: Journal has aborted
Feb  7 05:41:57 arc-swilliamslx kernel: __journal_remove_journal_head:
freeing b_committed_data
Feb  7 05:41:57 arc-swilliamslx kernel: ext3_abort called.
Feb  7 05:41:57 arc-swilliamslx kernel: EXT3-fs error (device drbd0):
ext3_journal_start_sb: Detected aborted journal
Feb  7 05:41:57 arc-swilliamslx kernel: Remounting filesystem read-only
(I believe this is where heartbeat stepped in and failed over to the
other server, postgresql was down due to the read-only mount)
Feb  7 05:42:32 arc-swilliamslx kernel: __journal_remove_journal_head:
freeing b_committed_data
Feb  7 05:42:32 arc-swilliamslx kernel: drbd0: role( Primary ->
Secondary ) 
Feb  7 05:42:32 arc-swilliamslx kernel: drbd0: Writing meta data super
block now.
Feb  7 05:42:34 arc-swilliamslx kernel: drbd0: peer( Secondary ->
Primary ) 

Any help will be greatly appreciated,
Doug

On Thu, 2008-02-07 at 14:04 -0500, Doug Knight wrote:

> Hi list,
> I had one of my HA systems, running drbd 8.0.1, issue an error on its
> drbd0 device (see title). We recently resized the underlying partition
> using gparted to include the partition immediately following it
> (verified that the new, larger partitions were identical, and ran the
> command to fix the meta-data, suggested when drbd was restarted). We
> did this on both systems, and everything seemed OK for a few days.
> This morning we got the error, heartbeat detected it, and migrated
> resources to the other system, no problem. I took drbd down on both
> systems, mounted and set primary drbd0 on the system with the issue,
> and did an fsck -fvn /dev/drbd0 on it (unmounted). I get the
> following:
> 
> The filesystem size (according to the superblock) is 29288495 blocks
> The physical size of the device is 29287592 blocks
> Either the superblock or the partition table is likely to be corrupt!
> 
> So, I then ran fsck without the -n to correct. Now, drbd seems to be
> completely hosed up. If I do a ./drbd start, the system locks up. If I
> do the drbdadm adjust pgsql, it locks up the system too. I went as far
> as to shutdown drbd, remove the kernel module, delete the sda5
> partition and recreate it, starting over, and it still locks up the
> system when I try to bring up drbd. What I'd like to do is fix the
> issue on this system, and let it get back in sync with the other
> system. So 1) How do I get drbd back and functioning on the system
> where the issue occurred?, and 2) Do I need to do anything to the
> system that is currently running OK (due to the partition resize,
> etc)?
> 
> Thanks,
> Doug Knight
> WSI Corp
> Andover, MA 01945 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20080208/29f92f97/attachment.htm>
-------------- next part --------------
resource pgsql {
  protocol C;
  #incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f";
  startup {
    #wfc-timeout         0;  ## Infinite!
    wfc-timeout         30;  ## 30 seconds
    degr-wfc-timeout   60;  ## 2 minutes.
  }
  disk {
    on-io-error detach;
  }
  net {
    # timeout           60;
    # connect-int       10;
    # ping-int          10;
    # max-buffers     2048;
    # max-epoch-size  2048;
  }
  syncer {
    rate   120M;
    #group   1;
    al-extents 257;
  }

  on arc-dknightlx {
    device      /dev/drbd0;
    #disk        /dev/sdc5; # pre-SAS drive install in slot SAS2
    disk        /dev/sdd5;
    address     10.4.4.4:7788;
    meta-disk   internal;
  }

  on arc-swilliamslx.wsicorp.com {
    device     /dev/drbd0;
    disk       /dev/sda5;
    address    10.4.4.5:7788;
    meta-disk  internal;
  }
}