[DRBD-user] DRBD attempts to access beyond end of device

Thu Feb 28 18:33:24 CET 2008

Hello,
I have had the same problem with DRBD version 8.0.6.
I have recently updated my kernel and DRBD version to 8.0.11 in the hope that 
it solves the problem.

Greetings.

On Thursday February 28 2008, Nate Seif wrote:
> On Thu, 28 Feb 2008, Nate Seif wrote:
> > On Wed, 27 Feb 2008, Lars Ellenberg wrote:
> >>  On Wed, Feb 27, 2008 at 01:07:04PM -0500, Nate Seif wrote:
> >> >  On Wed, 27 Feb 2008, Lars Ellenberg wrote:
> >> > >  On Tue, Feb 26, 2008 at 04:14:36PM -0500, Nate Seif wrote:
> >> > > >  Hello all:
> >> > > >  I intermittently experience the errors below while running DRBD
> >> > > > and would
> >> > > >  like to correct whatever condition is causing DRBD to randomly
> >> > > > lose pages.
> >> > > >  My hard disks and partitions are identical and have never given
> >> > > > me problems previously. I don't see any other disk I/O errors in
> >> > > > my logs. And
> >> > > >  it appears that occassionally (not always) these errors are
> >> > > > preceded by a
> >> > > >  resync of the two disks.
> >> > > >
> >> > > >  Why would DRBD "attempt to access beyond end of device"?
> >> > > >
> >> > > >  I am running DRBD 8.06 on Gentoo Linux as I could not get my
> >> > > > latest Gentoo kernel to load the DRBD module where version > 8.06.
> >> > > > Metadata is
> >> > > >  "internal" and I'm running Protocol C. I'd be happy to post my
> >> > > >  drbd.conf
> >> > > >  page if necessary.
> >> > > >
> >> > > >
> >> > > >  Feb 26 08:21:46 <hostname> attempt to access beyond end of device
> >> > > >  Feb 26 08:21:46 <hostname> drbd0: rw=1, want=211992584,
> >> > > >  limit=211986944
> >> > > >  Feb 26 08:21:46 <hostname> Buffer I/O error on device drbd0,
> >> > > > logical block
> >> > > >  26499072
> >> > > >  Feb 26 08:21:46 <hostname> lost page write due to I/O error on
> >> > > > drbd0 Feb 26 08:21:46 <hostname> attempt to access beyond end of
> >> > > > device Feb 26 08:21:46 <hostname> drbd0: rw=1, want=211992592,
> >> > > >  limit=211986944
> >> > > >  Feb 26 08:21:46 <hostname> Buffer I/O error on device drbd0,
> >> > > > logical block
> >> > > >  26499073
> >> > > >  Feb 26 08:21:46 <hostname> lost page write due to I/O error on
> >> > > > drbd0
> >> > > >
> >> > > >
> >> > > >  Any ideas, tips, help, etc. is much appreciated. Thank you -
> >> > >
> >> > >  let me guess:
> >> > >  you did mkfs /dev/sda1, not mkfs /dev/drbd0?
> >> > >  well, you screwed up.
> >> >
> >> >  I did NOT mkfs on /dev/hda4. (I have DRBD running on a pair of
> >> > IDE/PATA disks and no SATA drives in either system.)
> >> >
> >> >  I partitioned my disks with fdisk. I have identical drives with
> >> >  identically sized partitions. I compiled the DRBD module, started
> >> > DRBD, mounted /dev/drbd0 (not /dev/hda4), and formatted drbd0 with an
> >> > ext3 file system on the primary only after I got DRBD up and running
> >> > months ago.
> >>
> >>  please do
> >>
> >>       tune2fs -l /dev/mapper/vg00--bk1-root |
> >>   grep -e ^Block.count: -e ^Block.size:
> >
> > I do not have RAID on either system and /dev/mapper does not exist on
> > either machine. I have a single, identical hard drive in each system
> > where /dev/hda4 is the partition DRBD uses. Can I change the tune2fs
> > command you suggested above to get the bytes my ext3 FS thinks it's
> > occupying?
> >
> >>  you get two numbers.
> >>  multiply those, you get the size (in bytes)
> >>  your ext3 thinks it is occupying.
> >>  which is the size of the partition you run the mkfs on, at the time of
> >>  the mkfs run, unless you used special options.
> >>
> >>  now, do
> >>   grep -e hda4 -e drbd0 /proc/partitions
> >
> > # grep -e hda4 -e drbd0 /proc/partitions
> >    3     4  105996744 hda4
> > 147     0  105993472 drbd0
> > #
> >
> >>  you again get two numbers, this time unit is kilo byte.
> >>  that is the size of the partitions as the kernel sees them now.
> >>  according to the logs above (the limit= is unit sectors),
> >>  drbd0 will be 105993472 kB.
> >>  I dare say hda4 will be somewhat larger, my best guess, given the
> >>  information I have, is that hda4 will be 105996740 kB.
> >>  and that this also matches what the tune2fs reports.
> >
> > I imagine ext3 and the kernel need to have or use the same number for
> > file system size on /dev/drbd0. If these numbers differ, then I get the
> > errors I reported, correct? How (or is it possible?) to know whether the
> > size as ext3 sees it is correct or the kernel size is correct? Could a
> > corrupted inode be responsible for this problem? How do I avoid this
> > problem in the future? Can I run e2fsck on /dev/drbd0 to fix such a
> > problem?
> >
> >
> > ASIDE: I backed up my data from the primary side. Primary and secondary
> > machines went to "Primary/Unknown" and "Unknown/Secondary" after copying
> > a little less than 10 GB of data and drbd reports "NetworkFailure." All
> > NICs involved seem to be working fine - I can ping and copy files to/from
> > both computers but DRBD is disconnected.
>
> I'm not sure if this is germaine to the current discussion, but I spotted
> the following error(s) in my logs on the primary computer with /dev/drbd0
> mounted for r/w access at about the same time DRBD went to a
> "NetworkFailure" status:
>
> Feb 27 17:04:11 <hostname> drbd0: PingAck did not arrive in time.
> Feb 27 17:04:11 <hostname> drbd0: peer( Secondary -> Unknown ) conn(
> Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Feb 27 17:04:11 <hostname> drbd0: Creating new current UUID
> Feb 27 17:04:11 <hostname> drbd0: asender terminated
> Feb 27 17:04:11 <hostname> drbd0: short read expecting header on sock:
> r=-512
> Feb 27 17:04:13 <hostname> drbd0: tl_clear()
> Feb 27 17:04:13 <hostname> drbd0: Connection closed
> Feb 27 17:04:13 <hostname> drbd0: Writing meta data super block now.
> Feb 27 17:04:13 <hostname> drbd0: conn( NetworkFailure -> Unconnected )
> Feb 27 17:04:13 <hostname> drbd0: receiver terminated
> Feb 27 17:04:13 <hostname> drbd0: receiver (re)started
> Feb 27 17:04:13 <hostname> drbd0: conn( Unconnected -> WFConnection )
> .
> .
> .
> Feb 27 17:07:56 <hostname> drbd0: conn( WFConnection -> WFReportParams )
> Feb 27 17:07:56 <hostname> drbd0: Handshake successful: DRBD Network
> Protocol version 86
> Feb 27 17:07:56 <hostname> drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> Feb 27 17:08:06 <hostname> drbd0: PingAck did not arrive in time.
> Feb 27 17:08:06 <hostname> drbd0: peer( Secondary -> Unknown ) conn(
> WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Feb 27 17:08:06 <hostname> drbd0: asender terminated
> Feb 27 17:08:18 <hostname> drbd0: short sent ReportBitMap size=4096
> sent=3800
> Feb 27 17:08:18 <hostname> drbd0: Writing meta data super block now.
> Feb 27 17:08:18 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls
> drbd_md_sync().
> Feb 27 17:08:18 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls
> drbd_md_sync().
> Feb 27 17:08:18 <hostname> drbd0: tl_clear()
> Feb 27 17:08:18 <hostname> drbd0: Connection closed
> Feb 27 17:08:18 <hostname> drbd0: conn( NetworkFailure -> Unconnected )
> Feb 27 17:08:18 <hostname> drbd0: receiver terminated
> Feb 27 17:08:18 <hostname> drbd0: receiver (re)started
> Feb 27 17:08:18 <hostname> drbd0: conn( Unconnected -> WFConnection )
> .
> .
> .
> Feb 27 17:09:09 <hostname> drbd0: conn( WFConnection -> WFReportParams )
> Feb 27 17:09:09 <hostname> drbd0: Handshake successful: DRBD Network
> Protocol version 86
> Feb 27 17:09:09 <hostname> drbd0: peer( Unknown -> Secondary ) conn(
> WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
> Feb 27 17:09:19 <hostname> drbd0: meta connection shut down by peer.
> Feb 27 17:09:19 <hostname> drbd0: peer( Secondary -> Unknown ) conn(
> WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Feb 27 17:09:19 <hostname> drbd0: asender terminated
> Feb 27 17:09:25 <hostname> drbd0: short sent ReportBitMap size=4096
> sent=1152
> Feb 27 17:09:25 <hostname> drbd0: Writing meta data super block now.
> Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls
> drbd_md_sync().
> Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls
> drbd_md_sync().
> Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls
> drbd_md_sync().
>
> > Nate
> >
> >> >  My logs did not start recording these errors until several weeks ago.
> >>
> >>  which probably only means that the file system slowly filled up,
> >>  and now starts actually _using_ those areas which are no longer there,
> >>  because they are now occupied by the drbd meta data.
> >>
> >>  --
> >>
> >> :  Lars Ellenberg                           http://www.linbit.com :
> >> :  DRBD/HA support and consulting             sales at linbit.com :
> >> :  LINBIT Information Technologies GmbH      Tel +43-1-8178292-0  :
> >> :  Vivenotgasse 48, A-1120 Vienna/Europe     Fax +43-1-8178292-82 :
> >>
> >>  __
> >>  please use the "List-Reply" function of your email client.
> >>  _______________________________________________
> >>  drbd-user mailing list
> >>  drbd-user at lists.linbit.com
> >>  http://lists.linbit.com/mailman/listinfo/drbd-user
> >
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
---
UnlimitedMail.net - Carles Xavier Munyoz Baldó
cmunyoz at unlimitedmail.net
http://www.unlimitedmail.net/
---

---
La información contenida en este e-mail es confidencial, 
siendo para uso exclusivo del destinatario arriba mencionado.
Le informamos que está totalmente prohibida cualquier 
utilización, divulgación, distribución y/o reproducción de 
esta comunicación sin autorización expresa en virtud de la 
legislación vigente. Si ha recibido este mensaje por error, 
le rogamos nos lo notifique inmediatamente por la misma vía 
y proceda a su eliminación.
---