[DRBD-user] Re: filesystem corruptions

Wed Oct 19 14:50:45 CEST 2005

> P.S. To verify our earlier findings with more certainty, we have been
> running the server in WFConnection state for three days.  This morning,
> it continues the same way but we have also started two pairs of netcat
> over the gigabit link (in opposite direction) to give some stress to
> both Ethernet and SATA subsystems.  So far, no problem.  In a day or
> two, we'll start DRBD sync and see what happens.

OK, here is some reasonably certain information:

On 13th, we mounted freshly checked filesystem from drbd, in
FWConnection status, and put it in production (NFS exported).  Until
17th, there was network activity from NFS clients on 100Mbit interface,
from Legato backup on another 100Mbit interface, and no activity on
crossover Gbit interface.

On 17th, we started two pairs of netcat pipes in opposite directions
over Gbit crossover interface, copying from /dev/zero on one machine to
/dev/null on another.  It was running this way until today (19th) and
there where no problems.

Today, we killed netcats and started drbd on the peer node, so it began
to sync.  Filesystem was *not* remounted, NFS server was *not* restarted.

Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
WFConnection--> WFReportParams
Oct 19 11:55:54 snfs1 kernel: drbd0: Handshake successful: DRBD Network
Protocol version 74
Oct 19 11:55:54 snfs1 kernel: drbd0: Connection established.
Oct 19 11:55:54 snfs1 kernel: drbd0: I am(P):
1:00000002:00000003:00000004:00000005:10
Oct 19 11:55:54 snfs1 kernel: drbd0: Peer(S):
0:00000002:00000002:00000003:00000004:01
Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
WFReportParams --> WFBitMapS
Oct 19 11:55:55 snfs1 kernel: drbd0: Primary/Unknown --> Primary/Secondary
Oct 19 11:55:55 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
WFBitMapS --> SyncSource
Oct 19 11:55:55 snfs1 kernel: drbd0: Resync started as SyncSource (need
to sync 1741484304 KB [435371076 bits set]).

In about an hour, traditional trouble happend:

Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_readdir: bad entry in directory #96174334: rec_len is smaller than
minimal - offset=0, inode=1884609311, rec_len=0, name_len=0
Oct 19 12:50:58 snfs1 kernel: Aborting journal on device drbd0.
Oct 19 12:50:58 snfs1 kernel: journal commit I/O error
Oct 19 12:50:58 snfs1 last message repeated 4 times
Oct 19 12:50:58 snfs1 kernel: ext3_abort called.
Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_journal_start_sb: Detected aborted journal
Oct 19 12:50:58 snfs1 kernel: Remounting filesystem read-only
Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_readdir: bad entry in directory #96174334: rec_len is smaller than
minimal - offset=0, inode=1884609311, rec_len=0, name_len=0

at this moment, we stopped NFS server, did *not* stop drbd sync,
unmounted the filesystem, ran
# blockdev  --flushbufs /dev/drbd0
# blockdev  --flushbufs /dev/md3
mounted the filesystem back:

Oct 19 12:53:18 snfs1 kernel: EXT3-fs warning: mounting fs with errors,
running e2fsck is recommended

and ran "ls -lR" on the filesystem.  Instantly, similar ext3 error
manifested itself!

Oct 19 12:54:22 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_readdir: bad entry in directory #38766129: inode out of bounds -
offset=0, inode=1728214147, rec_len=512, name_len=16
Oct 19 12:54:22 snfs1 kernel: Aborting journal on device drbd0.
Oct 19 12:54:24 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_readdir: bad entry in directory #39469971: inode out of bounds -
offset=0, inode=1728214147, rec_len=512, name_len=16
Oct 19 12:54:26 snfs1 kernel: EXT3-fs error (device drbd0):
ext3_readdir: bad entry in directory #38766116: inode out of bounds -
offset=0, inode=1728214147, rec_len=512, name_len=16

Now, we stopped drbd on the peer:

Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_worker [1437]: cstate
SyncSource --> NetworkFailure
Oct 19 12:59:56 snfs1 kernel: drbd0: asender terminated
Oct 19 12:59:56 snfs1 kernel: drbd0: worker terminated
Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
NetworkFailure --> Unconnected
Oct 19 12:59:56 snfs1 kernel: drbd0: Connection lost.
Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate
Unconnected --> WFConnection

unmounted the filesystem and then ran fsck on /dev/drbd0.  No errors
where found.  Then we started  NFS server and it works fine since then.
 Also, we ran "ls -lR" again, and it did not trigger any problems.

So, what we know so far:
Filesystem errors are in memory structures only, not on disk.
They are not related to NFS (show up on local "ls -lR").
They are not related to ethernet activity per se.
They are triggered running drbd sync (and only that).

I would suggest that reading blocks from drbd sometimes yelds wrong data
*if* data is in progress of being synced to the peer.

My further plans: try to reproduce the problem in testing environment
(another block device on the same hosts), and find if it makes any
difference when you run drbd on top of md or on top of raw disk
partitions.  Other suggestions, please?

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20051019/6cfe80c4/attachment.pgp>