Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> P.S. To verify our earlier findings with more certainty, we have been > running the server in WFConnection state for three days. This morning, > it continues the same way but we have also started two pairs of netcat > over the gigabit link (in opposite direction) to give some stress to > both Ethernet and SATA subsystems. So far, no problem. In a day or > two, we'll start DRBD sync and see what happens. OK, here is some reasonably certain information: On 13th, we mounted freshly checked filesystem from drbd, in FWConnection status, and put it in production (NFS exported). Until 17th, there was network activity from NFS clients on 100Mbit interface, from Legato backup on another 100Mbit interface, and no activity on crossover Gbit interface. On 17th, we started two pairs of netcat pipes in opposite directions over Gbit crossover interface, copying from /dev/zero on one machine to /dev/null on another. It was running this way until today (19th) and there where no problems. Today, we killed netcats and started drbd on the peer node, so it began to sync. Filesystem was *not* remounted, NFS server was *not* restarted. Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate WFConnection--> WFReportParams Oct 19 11:55:54 snfs1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 74 Oct 19 11:55:54 snfs1 kernel: drbd0: Connection established. Oct 19 11:55:54 snfs1 kernel: drbd0: I am(P): 1:00000002:00000003:00000004:00000005:10 Oct 19 11:55:54 snfs1 kernel: drbd0: Peer(S): 0:00000002:00000002:00000003:00000004:01 Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate WFReportParams --> WFBitMapS Oct 19 11:55:55 snfs1 kernel: drbd0: Primary/Unknown --> Primary/Secondary Oct 19 11:55:55 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate WFBitMapS --> SyncSource Oct 19 11:55:55 snfs1 kernel: drbd0: Resync started as SyncSource (need to sync 1741484304 KB [435371076 bits set]). In about an hour, traditional trouble happend: Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): ext3_readdir: bad entry in directory #96174334: rec_len is smaller than minimal - offset=0, inode=1884609311, rec_len=0, name_len=0 Oct 19 12:50:58 snfs1 kernel: Aborting journal on device drbd0. Oct 19 12:50:58 snfs1 kernel: journal commit I/O error Oct 19 12:50:58 snfs1 last message repeated 4 times Oct 19 12:50:58 snfs1 kernel: ext3_abort called. Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): ext3_journal_start_sb: Detected aborted journal Oct 19 12:50:58 snfs1 kernel: Remounting filesystem read-only Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): ext3_readdir: bad entry in directory #96174334: rec_len is smaller than minimal - offset=0, inode=1884609311, rec_len=0, name_len=0 at this moment, we stopped NFS server, did *not* stop drbd sync, unmounted the filesystem, ran # blockdev --flushbufs /dev/drbd0 # blockdev --flushbufs /dev/md3 mounted the filesystem back: Oct 19 12:53:18 snfs1 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended and ran "ls -lR" on the filesystem. Instantly, similar ext3 error manifested itself! Oct 19 12:54:22 snfs1 kernel: EXT3-fs error (device drbd0): ext3_readdir: bad entry in directory #38766129: inode out of bounds - offset=0, inode=1728214147, rec_len=512, name_len=16 Oct 19 12:54:22 snfs1 kernel: Aborting journal on device drbd0. Oct 19 12:54:24 snfs1 kernel: EXT3-fs error (device drbd0): ext3_readdir: bad entry in directory #39469971: inode out of bounds - offset=0, inode=1728214147, rec_len=512, name_len=16 Oct 19 12:54:26 snfs1 kernel: EXT3-fs error (device drbd0): ext3_readdir: bad entry in directory #38766116: inode out of bounds - offset=0, inode=1728214147, rec_len=512, name_len=16 Now, we stopped drbd on the peer: Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_worker [1437]: cstate SyncSource --> NetworkFailure Oct 19 12:59:56 snfs1 kernel: drbd0: asender terminated Oct 19 12:59:56 snfs1 kernel: drbd0: worker terminated Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate NetworkFailure --> Unconnected Oct 19 12:59:56 snfs1 kernel: drbd0: Connection lost. Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate Unconnected --> WFConnection unmounted the filesystem and then ran fsck on /dev/drbd0. No errors where found. Then we started NFS server and it works fine since then. Also, we ran "ls -lR" again, and it did not trigger any problems. So, what we know so far: Filesystem errors are in memory structures only, not on disk. They are not related to NFS (show up on local "ls -lR"). They are not related to ethernet activity per se. They are triggered running drbd sync (and only that). I would suggest that reading blocks from drbd sometimes yelds wrong data *if* data is in progress of being synced to the peer. My further plans: try to reproduce the problem in testing environment (another block device on the same hosts), and find if it makes any difference when you run drbd on top of md or on top of raw disk partitions. Other suggestions, please? Eugene -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 256 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20051019/6cfe80c4/attachment.pgp>