Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I have had the same problem with DRBD version 8.0.6. I have recently updated my kernel and DRBD version to 8.0.11 in the hope that it solves the problem. Greetings. On Thursday February 28 2008, Nate Seif wrote: > On Thu, 28 Feb 2008, Nate Seif wrote: > > On Wed, 27 Feb 2008, Lars Ellenberg wrote: > >> On Wed, Feb 27, 2008 at 01:07:04PM -0500, Nate Seif wrote: > >> > On Wed, 27 Feb 2008, Lars Ellenberg wrote: > >> > > On Tue, Feb 26, 2008 at 04:14:36PM -0500, Nate Seif wrote: > >> > > > Hello all: > >> > > > I intermittently experience the errors below while running DRBD > >> > > > and would > >> > > > like to correct whatever condition is causing DRBD to randomly > >> > > > lose pages. > >> > > > My hard disks and partitions are identical and have never given > >> > > > me problems previously. I don't see any other disk I/O errors in > >> > > > my logs. And > >> > > > it appears that occassionally (not always) these errors are > >> > > > preceded by a > >> > > > resync of the two disks. > >> > > > > >> > > > Why would DRBD "attempt to access beyond end of device"? > >> > > > > >> > > > I am running DRBD 8.06 on Gentoo Linux as I could not get my > >> > > > latest Gentoo kernel to load the DRBD module where version > 8.06. > >> > > > Metadata is > >> > > > "internal" and I'm running Protocol C. I'd be happy to post my > >> > > > drbd.conf > >> > > > page if necessary. > >> > > > > >> > > > > >> > > > Feb 26 08:21:46 <hostname> attempt to access beyond end of device > >> > > > Feb 26 08:21:46 <hostname> drbd0: rw=1, want=211992584, > >> > > > limit=211986944 > >> > > > Feb 26 08:21:46 <hostname> Buffer I/O error on device drbd0, > >> > > > logical block > >> > > > 26499072 > >> > > > Feb 26 08:21:46 <hostname> lost page write due to I/O error on > >> > > > drbd0 Feb 26 08:21:46 <hostname> attempt to access beyond end of > >> > > > device Feb 26 08:21:46 <hostname> drbd0: rw=1, want=211992592, > >> > > > limit=211986944 > >> > > > Feb 26 08:21:46 <hostname> Buffer I/O error on device drbd0, > >> > > > logical block > >> > > > 26499073 > >> > > > Feb 26 08:21:46 <hostname> lost page write due to I/O error on > >> > > > drbd0 > >> > > > > >> > > > > >> > > > Any ideas, tips, help, etc. is much appreciated. Thank you - > >> > > > >> > > let me guess: > >> > > you did mkfs /dev/sda1, not mkfs /dev/drbd0? > >> > > well, you screwed up. > >> > > >> > I did NOT mkfs on /dev/hda4. (I have DRBD running on a pair of > >> > IDE/PATA disks and no SATA drives in either system.) > >> > > >> > I partitioned my disks with fdisk. I have identical drives with > >> > identically sized partitions. I compiled the DRBD module, started > >> > DRBD, mounted /dev/drbd0 (not /dev/hda4), and formatted drbd0 with an > >> > ext3 file system on the primary only after I got DRBD up and running > >> > months ago. > >> > >> please do > >> > >> tune2fs -l /dev/mapper/vg00--bk1-root | > >> grep -e ^Block.count: -e ^Block.size: > > > > I do not have RAID on either system and /dev/mapper does not exist on > > either machine. I have a single, identical hard drive in each system > > where /dev/hda4 is the partition DRBD uses. Can I change the tune2fs > > command you suggested above to get the bytes my ext3 FS thinks it's > > occupying? > > > >> you get two numbers. > >> multiply those, you get the size (in bytes) > >> your ext3 thinks it is occupying. > >> which is the size of the partition you run the mkfs on, at the time of > >> the mkfs run, unless you used special options. > >> > >> now, do > >> grep -e hda4 -e drbd0 /proc/partitions > > > > # grep -e hda4 -e drbd0 /proc/partitions > > 3 4 105996744 hda4 > > 147 0 105993472 drbd0 > > # > > > >> you again get two numbers, this time unit is kilo byte. > >> that is the size of the partitions as the kernel sees them now. > >> according to the logs above (the limit= is unit sectors), > >> drbd0 will be 105993472 kB. > >> I dare say hda4 will be somewhat larger, my best guess, given the > >> information I have, is that hda4 will be 105996740 kB. > >> and that this also matches what the tune2fs reports. > > > > I imagine ext3 and the kernel need to have or use the same number for > > file system size on /dev/drbd0. If these numbers differ, then I get the > > errors I reported, correct? How (or is it possible?) to know whether the > > size as ext3 sees it is correct or the kernel size is correct? Could a > > corrupted inode be responsible for this problem? How do I avoid this > > problem in the future? Can I run e2fsck on /dev/drbd0 to fix such a > > problem? > > > > > > ASIDE: I backed up my data from the primary side. Primary and secondary > > machines went to "Primary/Unknown" and "Unknown/Secondary" after copying > > a little less than 10 GB of data and drbd reports "NetworkFailure." All > > NICs involved seem to be working fine - I can ping and copy files to/from > > both computers but DRBD is disconnected. > > I'm not sure if this is germaine to the current discussion, but I spotted > the following error(s) in my logs on the primary computer with /dev/drbd0 > mounted for r/w access at about the same time DRBD went to a > "NetworkFailure" status: > > Feb 27 17:04:11 <hostname> drbd0: PingAck did not arrive in time. > Feb 27 17:04:11 <hostname> drbd0: peer( Secondary -> Unknown ) conn( > Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Feb 27 17:04:11 <hostname> drbd0: Creating new current UUID > Feb 27 17:04:11 <hostname> drbd0: asender terminated > Feb 27 17:04:11 <hostname> drbd0: short read expecting header on sock: > r=-512 > Feb 27 17:04:13 <hostname> drbd0: tl_clear() > Feb 27 17:04:13 <hostname> drbd0: Connection closed > Feb 27 17:04:13 <hostname> drbd0: Writing meta data super block now. > Feb 27 17:04:13 <hostname> drbd0: conn( NetworkFailure -> Unconnected ) > Feb 27 17:04:13 <hostname> drbd0: receiver terminated > Feb 27 17:04:13 <hostname> drbd0: receiver (re)started > Feb 27 17:04:13 <hostname> drbd0: conn( Unconnected -> WFConnection ) > . > . > . > Feb 27 17:07:56 <hostname> drbd0: conn( WFConnection -> WFReportParams ) > Feb 27 17:07:56 <hostname> drbd0: Handshake successful: DRBD Network > Protocol version 86 > Feb 27 17:07:56 <hostname> drbd0: peer( Unknown -> Secondary ) conn( > WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) > Feb 27 17:08:06 <hostname> drbd0: PingAck did not arrive in time. > Feb 27 17:08:06 <hostname> drbd0: peer( Secondary -> Unknown ) conn( > WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Feb 27 17:08:06 <hostname> drbd0: asender terminated > Feb 27 17:08:18 <hostname> drbd0: short sent ReportBitMap size=4096 > sent=3800 > Feb 27 17:08:18 <hostname> drbd0: Writing meta data super block now. > Feb 27 17:08:18 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls > drbd_md_sync(). > Feb 27 17:08:18 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls > drbd_md_sync(). > Feb 27 17:08:18 <hostname> drbd0: tl_clear() > Feb 27 17:08:18 <hostname> drbd0: Connection closed > Feb 27 17:08:18 <hostname> drbd0: conn( NetworkFailure -> Unconnected ) > Feb 27 17:08:18 <hostname> drbd0: receiver terminated > Feb 27 17:08:18 <hostname> drbd0: receiver (re)started > Feb 27 17:08:18 <hostname> drbd0: conn( Unconnected -> WFConnection ) > . > . > . > Feb 27 17:09:09 <hostname> drbd0: conn( WFConnection -> WFReportParams ) > Feb 27 17:09:09 <hostname> drbd0: Handshake successful: DRBD Network > Protocol version 86 > Feb 27 17:09:09 <hostname> drbd0: peer( Unknown -> Secondary ) conn( > WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate ) > Feb 27 17:09:19 <hostname> drbd0: meta connection shut down by peer. > Feb 27 17:09:19 <hostname> drbd0: peer( Secondary -> Unknown ) conn( > WFBitMapS -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Feb 27 17:09:19 <hostname> drbd0: asender terminated > Feb 27 17:09:25 <hostname> drbd0: short sent ReportBitMap size=4096 > sent=1152 > Feb 27 17:09:25 <hostname> drbd0: Writing meta data super block now. > Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls > drbd_md_sync(). > Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls > drbd_md_sync(). > Feb 27 17:09:25 <hostname> drbd0: BUG! md_sync_timer expired! Worker calls > drbd_md_sync(). > > > Nate > > > >> > My logs did not start recording these errors until several weeks ago. > >> > >> which probably only means that the file system slowly filled up, > >> and now starts actually _using_ those areas which are no longer there, > >> because they are now occupied by the drbd meta data. > >> > >> -- > >> > >> : Lars Ellenberg http://www.linbit.com : > >> : DRBD/HA support and consulting sales at linbit.com : > >> : LINBIT Information Technologies GmbH Tel +43-1-8178292-0 : > >> : Vivenotgasse 48, A-1120 Vienna/Europe Fax +43-1-8178292-82 : > >> > >> __ > >> please use the "List-Reply" function of your email client. > >> _______________________________________________ > >> drbd-user mailing list > >> drbd-user at lists.linbit.com > >> http://lists.linbit.com/mailman/listinfo/drbd-user > > > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- --- UnlimitedMail.net - Carles Xavier Munyoz Baldó cmunyoz at unlimitedmail.net http://www.unlimitedmail.net/ --- --- La información contenida en este e-mail es confidencial, siendo para uso exclusivo del destinatario arriba mencionado. Le informamos que está totalmente prohibida cualquier utilización, divulgación, distribución y/o reproducción de esta comunicación sin autorización expresa en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos nos lo notifique inmediatamente por la misma vía y proceda a su eliminación. ---