[DRBD-user] Re: filesystem corruptions

Eugene Crosser crosser at rol.ru
Mon Oct 17 15:03:41 CEST 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Lars Ellenberg wrote:

>>>>Anyway, the 2.6.11.12 system with this
>>>>http://www.kernel.org/git/?p=linux/kernel/git/gregkh/linux-2.6.12.y.git;a=c
>>>>ommitdiff;h=60372783e59079bdfd3ba0477e1907669249a489 patch applied, and
>>>>filesystem mounted on drbd, works for almost 24 hours now under production
>>>>load, and no filesystem errors happened so far.
>>>>
>>>
>>>[...]
>>>
>>>Thanks for those information, I began to become really worried! We are now in 
>>>failover mode for a couple of days with drbd on top of software-raid1.
>>
>>Don't stop worrying!
>>An hour after we started drbd sync to the secondary, we got this:
>>
>>Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
>>ext3_readdir: bad entry in directory #89899527: inode out of bounds -
>>offset=0, inode=1728214147, rec_len=512, name_len=16
>>Oct 12 13:13:16 snfs1 kernel: Aborting journal on device drbd0.
>>Oct 12 13:13:16 snfs1 kernel: journal commit I/O error
>>Oct 12 13:13:16 snfs1 last message repeated 4 times
>>Oct 12 13:13:16 snfs1 kernel: ext3_abort called.
>>Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
>>ext3_journal_start_sb: Detected aborted journal
>>Oct 12 13:13:16 snfs1 kernel: Remounting filesystem read-only
>>Oct 12 13:13:16 snfs1 kernel: journal commit I/O error
>>Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
>>ext3_readdir: bad entry in directory #89899527: inode out of bounds -
>>offset=0, inode=1728214147, rec_len=512, name_len=16
>>Oct 12 13:13:16 snfs1 kernel: EXT3-fs error (device drbd0):
>>ext3_readdir: bad entry in directory #89899527: inode out of bounds -
>>offset=0, inode=1728214147, rec_len=512, name_len=16
>>
>>I'll be back when we have more information...
>>
>>Eugene
> 
> 
> these look familiar...
> this reads _exactly_ as a message which can be provoked with
> nfs clients.
> do this:
>  .  export something via nfs, let some clients connect.
>  .  take the network link or ip of the nfs server down.
>     now, all clients will block on uncached nfs access
>  .  take the nfs server down.
>  .  manipulate the underlying nfs structure, e.g.
>     resize the file system, or do a
>     tar cf backup.tar; mkfs; tar xf backup.tar
>  .  start serving again, and let the clients reconnect
>  they will reconnect, and they won't have stale handles, but they will
>  have wrong inodes in their requests. if these inodes are now
>  out-of-bounds -- e.g. you resized the fs, or they point to something
>  that now is not a directory, and they try to continue a readdir or
>  something like that -- you will get exactly those messages quoted above.

The course of events was like this:
- system running with filesystem mounted on drbd in WFConnection state
for a hole day, under load (nfs clients).
- secondary system brought up and told 'ivalidate'.  Full sync began.
- in an hour, filesystem on the primary got errors and went r/o.

But, this reminds me of another problem in the 2.6 kernel.  NFS *client*
apparently has a bug which shows up extremely rarely.  It manifests as
nfs files that are successfully appended but do have the same size
afterwards, and appended data lost.  Or as files with "holes" (parts
filled with zeroes, that do not add to used blocks as per 'du'.

Could it be related somehow?..

Eugene

P.S. To verify our earlier findings with more certainty, we have been
running the server in WFConnection state for three days.  This morning,
it continues the same way but we have also started two pairs of netcat
over the gigabit link (in opposite direction) to give some stress to
both Ethernet and SATA subsystems.  So far, no problem.  In a day or
two, we'll start DRBD sync and see what happens.

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20051017/3c613c02/attachment.pgp>


More information about the drbd-user mailing list