[DRBD-user] Re: filesystem corruptions

Wed Oct 19 16:33:33 CEST 2005

Am Mittwoch, 19. Oktober 2005 14:50 schrieb Eugene Crosser:
> > P.S. To verify our earlier findings with more certainty, we have been
> > running the server in WFConnection state for three days.  This morning,
> > it continues the same way but we have also started two pairs of netcat
> > over the gigabit link (in opposite direction) to give some stress to
> > both Ethernet and SATA subsystems.  So far, no problem.  In a day or
> > two, we'll start DRBD sync and see what happens.
>
> OK, here is some reasonably certain information:
>
> On 13th, we mounted freshly checked filesystem from drbd, in
> FWConnection status, and put it in production (NFS exported).  Until
> 17th, there was network activity from NFS clients on 100Mbit interface,
> from Legato backup on another 100Mbit interface, and no activity on
> crossover Gbit interface.
>
> On 17th, we started two pairs of netcat pipes in opposite directions
> over Gbit crossover interface, copying from /dev/zero on one machine to
> /dev/null on another.  It was running this way until today (19th) and
> there where no problems.
>
> Today, we killed netcats and started drbd on the peer node, so it began
> to sync.  Filesystem was *not* remounted, NFS server was *not* restarted.
>
> Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
> WFConnection--> WFReportParams
> Oct 19 11:55:54 snfs1 kernel: drbd0: Handshake successful: DRBD Network
> Protocol version 74
> Oct 19 11:55:54 snfs1 kernel: drbd0: Connection established.
> Oct 19 11:55:54 snfs1 kernel: drbd0: I am(P):
> 1:00000002:00000003:00000004:00000005:10
> Oct 19 11:55:54 snfs1 kernel: drbd0: Peer(S):
> 0:00000002:00000002:00000003:00000004:01
> Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
> WFReportParams --> WFBitMapS
> Oct 19 11:55:55 snfs1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
> Oct 19 11:55:55 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
> WFBitMapS --> SyncSource
> Oct 19 11:55:55 snfs1 kernel: drbd0: Resync started as SyncSource (need
> to sync 1741484304 KB [435371076 bits set]).
>
> In about an hour, traditional trouble happend:
>
> Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #96174334: rec_len is smaller than
> minimal - offset=0, inode=1884609311, rec_len=0, name_len=0
> Oct 19 12:50:58 snfs1 kernel: Aborting journal on device drbd0.
> Oct 19 12:50:58 snfs1 kernel: journal commit I/O error
> Oct 19 12:50:58 snfs1 last message repeated 4 times
> Oct 19 12:50:58 snfs1 kernel: ext3_abort called.
> Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_journal_start_sb: Detected aborted journal
> Oct 19 12:50:58 snfs1 kernel: Remounting filesystem read-only
> Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #96174334: rec_len is smaller than
> minimal - offset=0, inode=1884609311, rec_len=0, name_len=0
>
> at this moment, we stopped NFS server, did *not* stop drbd sync,
> unmounted the filesystem, ran
> # blockdev  --flushbufs /dev/drbd0
> # blockdev  --flushbufs /dev/md3
> mounted the filesystem back:
>
> Oct 19 12:53:18 snfs1 kernel: EXT3-fs warning: mounting fs with errors,
> running e2fsck is recommended
>
> and ran "ls -lR" on the filesystem.  Instantly, similar ext3 error
> manifested itself!
>
> Oct 19 12:54:22 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #38766129: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
> Oct 19 12:54:22 snfs1 kernel: Aborting journal on device drbd0.
> Oct 19 12:54:24 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #39469971: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
> Oct 19 12:54:26 snfs1 kernel: EXT3-fs error (device drbd0):
> ext3_readdir: bad entry in directory #38766116: inode out of bounds -
> offset=0, inode=1728214147, rec_len=512, name_len=16
>
> Now, we stopped drbd on the peer:
>
> Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_worker [1437]: cstate
> SyncSource --> NetworkFailure
> Oct 19 12:59:56 snfs1 kernel: drbd0: asender terminated
> Oct 19 12:59:56 snfs1 kernel: drbd0: worker terminated
> Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
> NetworkFailure --> Unconnected
> Oct 19 12:59:56 snfs1 kernel: drbd0: Connection lost.
> Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
> Unconnected --> WFConnection
>
> unmounted the filesystem and then ran fsck on /dev/drbd0.  No errors
> where found.  Then we started  NFS server and it works fine since then.
>  Also, we ran "ls -lR" again, and it did not trigger any problems.
>
> So, what we know so far:
> Filesystem errors are in memory structures only, not on disk.
> They are not related to NFS (show up on local "ls -lR").
> They are not related to ethernet activity per se.
> They are triggered running drbd sync (and only that).
>
> I would suggest that reading blocks from drbd sometimes yelds wrong data
> *if* data is in progress of being synced to the peer.
>
> My further plans: try to reproduce the problem in testing environment
> (another block device on the same hosts), and find if it makes any
> difference when you run drbd on top of md or on top of raw disk
> partitions.  Other suggestions, please?
>
> Eugene

Hi Eugene,

I guess I will find some time on Friday to look into this issue, 
I am happy about every bit of information that you have collected
until Friday.

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Schönbrunnerstr 244, 1120 Vienna, Austria    http://www.linbit.com :