Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Eugene Crosser wrote: > I have preliminary information about a similar filesystem corruption in > a completely different environment: drbd on raw SCSI(!) partition, > without md. I did not have a chance to investigate yet, even reproduce. > I'll try it these days. It's non-production system, and will be much > easier to experiment with. (SMP, vanilla 2.6.14 kernel, drbd 0.7.14). Gentlemen, I can come up with *some* findings now, but no definite information, yet. These days, I had at my disposal two Xeon servers, running SMP/Hyperthreading kernel 2.6.14 (vanilla from kernel.org) and drbd 0.7.14 (SVN 1989). There is no hardware RAID or md, just a single SCSI disk in each box, on AIC-7902 U320 controller, and two e1000 NICs. After DRBD was first created, and some filesystem actifity initiated, I got familiar ext3-fs errors. But after I did "invalidate all" on the secondary a few times, the errors stopped to occur, and after a couple of node switchovers I cannot reproduce the problem anymore. During the time when there where reproducable fs errors on the drbd device, I also got one similar filesystem error on a *non*-drbd partition! First, I said to myself: "oh, then it's hardware this time." But it does not happen anymore, now when there are no errors on drbd device either. So maybe it's not. Maybe it's corruption of some kernel structures? One other thing that I noticed, just once, and did not have a chance to reproduce: when master got fs error, I umounted the filesystem and waited for drbd to complete synchronization. Then fsck on the master did not notice any errors. But when I disconnected the secondary, made it primary and run fsck, there *where* errors there. Now I vaguely recall that I *may* have seen some similar problems on my currently-stable systems immediately after their setup, but I tested them extensively, and the problems did not reproduce. So, my current theory is that "virtual corruption" problem only happens on freshly created drbd devices, and go away after a sync, or several syncs. I currently have a couple of spare Dells, similar to that I have reliably running in production. I'll try to verify my findings on them, and then be back. Eugene -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 256 bytes Desc: OpenPGP digital signature URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20051114/9e35f78e/attachment.pgp>