[DRBD-user] Re: filesystem corruptions

Mon Oct 17 10:52:58 CEST 2005

/ 2005-10-12 15:42:04 +0400
\ Eugene Crosser:
> Bernd Schubert wrote:
> 
> >>Anyway, the 2.6.11.12 system with this
> >>http://www.kernel.org/git/?p=linux/kernel/git/gregkh/linux-2.6.12.y.git;a=c
> >>ommitdiff;h=60372783e59079bdfd3ba0477e1907669249a489 patch applied, and
> >>filesystem mounted on drbd, works for almost 24 hours now under production
> >>load, and no filesystem errors happened so far.
> >>
> > 
> > [...]
> > 
> > Thanks for those information, I began to become really worried! We are now in 
> > failover mode for a couple of days with drbd on top of software-raid1.
> 
> Don't stop worrying!
> An hour after we started drbd sync to the secondary, we got this:
> 
> Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #89899527: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
> Oct 12 13:13:16 snfs1 kernel: Aborting journal on device drbd0.
> Oct 12 13:13:16 snfs1 kernel: journal commit I/O error
> Oct 12 13:13:16 snfs1 last message repeated 4 times
> Oct 12 13:13:16 snfs1 kernel: ext3_abort called.
> Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_journal_start_sb: Detected aborted journal
> Oct 12 13:13:16 snfs1 kernel: Remounting filesystem read-only
> Oct 12 13:13:16 snfs1 kernel: journal commit I/O error
> Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #89899527: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
> Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #89899527: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
> 
> I'll be back when we have more information...
> 
> Eugene

these look familiar...
this reads _exactly_ as a message which can be provoked with
nfs clients.
do this:
 .  export something via nfs, let some clients connect.
 .  take the network link or ip of the nfs server down.
    now, all clients will block on uncached nfs access
 .  take the nfs server down.
 .  manipulate the underlying nfs structure, e.g.
    resize the file system, or do a
    tar cf backup.tar; mkfs; tar xf backup.tar
 .  start serving again, and let the clients reconnect
 they will reconnect, and they won't have stale handles, but they will
 have wrong inodes in their requests. if these inodes are now
 out-of-bounds -- e.g. you resized the fs, or they point to something
 that now is not a directory, and they try to continue a readdir or
 something like that -- you will get exactly those messages quoted above.

-- 
: Lars Ellenberg                                  Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
: Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
__
please use the "List-Reply" function of your email client.