[DRBD-user] DRBD crash with "attempt to write beyond end of device"

Lars Ellenberg lars.ellenberg at linbit.com
Fri Feb 8 20:13:39 CET 2008

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


it is not drbd that crashes with anything here,
but the filesystem, because you screwed up and effectively truncated it.

On Fri, Feb 08, 2008 at 10:19:49AM -0500, Doug Knight wrote:
> I'm really getting desparate on this, as we are currently not in a high
> availability state with our server, so I thought I'd include some more
> info. Attached is my drbd.conf. Also, I am running RHEL5
> 2.6.18-8.1.14.el5 on both systems. Below is a capture from my system
> messages log from the original failure:

if I understand correctly, what you did is
 1) have some partition, with drbd and internal meta data on it,
 and a happy file system on drbd.

 2) stop drbd (first get it into Connected, Secondary/Secondary)
 3) use parted to resize the partition
 3.1) which also resized the file system on that partition
 4) created new internal drbd meta data
 5) started drbd again
 6) tried to use the now file system on drbd, which fails

if that was indeed what was happening,
you screwed up in 3.1, or latest with 4).
see below.

if that description does not at all match what you did,
please ignore the rest and describe yourself exactly what you did.

> > Hi list,
> > I had one of my HA systems, running drbd 8.0.1, issue an error on its

and we are some versions ahead of 8.0.1, so please upgrade.

> > drbd0 device (see title). We recently resized the underlying partition
> > using gparted to include the partition immediately following it
> > (verified that the new, larger partitions were identical, and ran the
> > command to fix the meta-data, suggested when drbd was restarted). We
> > did this on both systems, and everything seemed OK for a few days.
> > This morning we got the error, heartbeat detected it, and migrated
> > resources to the other system, no problem. I took drbd down on both
> > systems, mounted and set primary drbd0 on the system with the issue,
> > and did an fsck -fvn /dev/drbd0 on it (unmounted). I get the
> > following:
> > 
> > The filesystem size (according to the superblock) is 29288495 blocks
> > The physical size of the device is 29287592 blocks
> > Either the superblock or the partition table is likely to be corrupt!

the fs resize in 3.1, not knowing that you needed to keep some unused MB
for the drbd meta data at the end of the device, resized the fs
to use up the full partition.

when creating the new internal meta data in 4),
it used that last some MB, and when you now up drbd,
the "drbd partition" is exactly that some MB smaller than the lower
level "real partition".

what you should have done is
 either:
   in 4) use DRBD external meta data,
   which would have worked just fine.
 or:
   1,2,3 BUT NOT 3.1 (regardless of wether parted did the fs resize,
   or you did it your self)
   repeat 1,2,3 on the other node (still _without_ the fs resize),
   create the new insternal drbd meta data on both nodes
   connect the drbds
   chose one (preferably the one that had been primary last)
   make that primary again, using the "overwrite data of peer" thing.
   wait for the resync to happen.

   now you still have the file system in the old size.
   but you can verify that your DRBD is indeed the new size (minus
   whatever drbd needs for its internal meta data).
   so, after you verified that, you do the file system resize on the
   _drbd_ NOT on the lower level partition.

 
> > So, I then ran fsck without the -n to correct. Now, drbd seems to be
> > completely hosed up. If I do a ./drbd start, the system locks up. If I
> > do the drbdadm adjust pgsql, it locks up the system too. I went as far
> > as to shutdown drbd, remove the kernel module, delete the sda5
> > partition and recreate it, starting over, and it still locks up the
> > system when I try to bring up drbd. What I'd like to do is fix the
> > issue on this system, and let it get back in sync with the other
> > system. So 1) How do I get drbd back and functioning on the system
> > where the issue occurred?, and 2) Do I need to do anything to the
> > system that is currently running OK (due to the partition resize,
> > etc)?

possible way out:

stop drbd completely.
fsck /dev/sd-whatever-it-is
resize2fs /dev/sd-whatever-it-is THE_SMALLER_SIZE_which_is_the_real_size_of_the_drbd
hope for the best
start drbd again,
mount drbd
compare with backups.

or
umount /dev/drbd
mkfs /dev/drbd
restore from backup.


-- 
: Lars Ellenberg                           http://www.linbit.com :
: DRBD/HA support and consulting             sales at linbit.com :
: LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
: Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list