Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Am Mittwoch, 19. Oktober 2005 14:50 schrieb Eugene Crosser: > > P.S. To verify our earlier findings with more certainty, we have been > > running the server in WFConnection state for three days. This morning, > > it continues the same way but we have also started two pairs of netcat > > over the gigabit link (in opposite direction) to give some stress to > > both Ethernet and SATA subsystems. So far, no problem. In a day or > > two, we'll start DRBD sync and see what happens. > > OK, here is some reasonably certain information: > > On 13th, we mounted freshly checked filesystem from drbd, in > FWConnection status, and put it in production (NFS exported). Until > 17th, there was network activity from NFS clients on 100Mbit interface, > from Legato backup on another 100Mbit interface, and no activity on > crossover Gbit interface. > > On 17th, we started two pairs of netcat pipes in opposite directions > over Gbit crossover interface, copying from /dev/zero on one machine to > /dev/null on another. It was running this way until today (19th) and > there where no problems. > > Today, we killed netcats and started drbd on the peer node, so it began > to sync. Filesystem was *not* remounted, NFS server was *not* restarted. > > Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate > WFConnection--> WFReportParams > Oct 19 11:55:54 snfs1 kernel: drbd0: Handshake successful: DRBD Network > Protocol version 74 > Oct 19 11:55:54 snfs1 kernel: drbd0: Connection established. > Oct 19 11:55:54 snfs1 kernel: drbd0: I am(P): > 1:00000002:00000003:00000004:00000005:10 > Oct 19 11:55:54 snfs1 kernel: drbd0: Peer(S): > 0:00000002:00000002:00000003:00000004:01 > Oct 19 11:55:54 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate > WFReportParams --> WFBitMapS > Oct 19 11:55:55 snfs1 kernel: drbd0: Primary/Unknown --> Primary/Secondary > Oct 19 11:55:55 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate > WFBitMapS --> SyncSource > Oct 19 11:55:55 snfs1 kernel: drbd0: Resync started as SyncSource (need > to sync 1741484304 KB [435371076 bits set]). > > In about an hour, traditional trouble happend: > > Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_readdir: bad entry in directory #96174334: rec_len is smaller than > minimal - offset=0, inode=1884609311, rec_len=0, name_len=0 > Oct 19 12:50:58 snfs1 kernel: Aborting journal on device drbd0. > Oct 19 12:50:58 snfs1 kernel: journal commit I/O error > Oct 19 12:50:58 snfs1 last message repeated 4 times > Oct 19 12:50:58 snfs1 kernel: ext3_abort called. > Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_journal_start_sb: Detected aborted journal > Oct 19 12:50:58 snfs1 kernel: Remounting filesystem read-only > Oct 19 12:50:58 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_readdir: bad entry in directory #96174334: rec_len is smaller than > minimal - offset=0, inode=1884609311, rec_len=0, name_len=0 > > at this moment, we stopped NFS server, did *not* stop drbd sync, > unmounted the filesystem, ran > # blockdev --flushbufs /dev/drbd0 > # blockdev --flushbufs /dev/md3 > mounted the filesystem back: > > Oct 19 12:53:18 snfs1 kernel: EXT3-fs warning: mounting fs with errors, > running e2fsck is recommended > > and ran "ls -lR" on the filesystem. Instantly, similar ext3 error > manifested itself! > > Oct 19 12:54:22 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_readdir: bad entry in directory #38766129: inode out of bounds - > offset=0, inode=1728214147, rec_len=512, name_len=16 > Oct 19 12:54:22 snfs1 kernel: Aborting journal on device drbd0. > Oct 19 12:54:24 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_readdir: bad entry in directory #39469971: inode out of bounds - > offset=0, inode=1728214147, rec_len=512, name_len=16 > Oct 19 12:54:26 snfs1 kernel: EXT3-fs error (device drbd0): > ext3_readdir: bad entry in directory #38766116: inode out of bounds - > offset=0, inode=1728214147, rec_len=512, name_len=16 > > Now, we stopped drbd on the peer: > > Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_worker [1437]: cstate > SyncSource --> NetworkFailure > Oct 19 12:59:56 snfs1 kernel: drbd0: asender terminated > Oct 19 12:59:56 snfs1 kernel: drbd0: worker terminated > Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate > NetworkFailure --> Unconnected > Oct 19 12:59:56 snfs1 kernel: drbd0: Connection lost. > Oct 19 12:59:56 snfs1 kernel: drbd0: drbd0_receiver [1450]: cstate > Unconnected --> WFConnection > > unmounted the filesystem and then ran fsck on /dev/drbd0. No errors > where found. Then we started NFS server and it works fine since then. > Also, we ran "ls -lR" again, and it did not trigger any problems. > > So, what we know so far: > Filesystem errors are in memory structures only, not on disk. > They are not related to NFS (show up on local "ls -lR"). > They are not related to ethernet activity per se. > They are triggered running drbd sync (and only that). > > I would suggest that reading blocks from drbd sometimes yelds wrong data > *if* data is in progress of being synced to the peer. > > My further plans: try to reproduce the problem in testing environment > (another block device on the same hosts), and find if it makes any > difference when you run drbd on top of md or on top of raw disk > partitions. Other suggestions, please? > > Eugene Hi Eugene, I guess I will find some time on Friday to look into this issue, I am happy about every bit of information that you have collected until Friday. -phil -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com :