[DRBD-user] Re: filesystem corruptions

Eugene Crosser crosser at rol.ru
Mon Nov 14 10:31:17 CET 2005

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Eugene Crosser wrote:

> I have preliminary information about a similar filesystem corruption in
> a completely different environment: drbd on raw SCSI(!) partition,
> without md.  I did not have a chance to investigate yet, even reproduce.
>  I'll try it these days.  It's non-production system, and will be much
> easier to experiment with.  (SMP, vanilla 2.6.14 kernel, drbd 0.7.14).

Gentlemen,
I can come up with *some* findings now, but no definite information, yet.

These days, I had at my disposal two Xeon servers, running
SMP/Hyperthreading kernel 2.6.14 (vanilla from kernel.org) and drbd
0.7.14 (SVN 1989).  There is no hardware RAID or md, just a single SCSI
disk in each box, on AIC-7902 U320 controller, and two e1000 NICs.
After DRBD was first created, and some filesystem actifity initiated, I
got familiar ext3-fs errors.  But after I did "invalidate all" on the
secondary a few times, the errors stopped to occur, and after a couple
of node switchovers I cannot reproduce the problem anymore.

During the time when there where reproducable fs errors on the drbd
device, I also got one similar filesystem error on a *non*-drbd
partition!  First, I said to myself: "oh, then it's hardware this time."
But it does not happen anymore, now when there are no errors on drbd
device either.  So maybe it's not.  Maybe it's corruption of some kernel
structures?

One other thing that I noticed, just once, and did not have a chance
to reproduce: when master got fs error, I umounted the filesystem and
waited for drbd to complete synchronization. Then fsck on the master did
not notice any errors.  But when I disconnected the secondary, made it
primary and run fsck, there *where* errors there.

Now I vaguely recall that I *may* have seen some similar problems on my
currently-stable systems immediately after their setup, but I tested
them extensively, and the problems did not reproduce.  So, my current
theory is that "virtual corruption" problem only happens on freshly
created drbd devices, and go away after a sync, or several syncs.

I currently have a couple of spare Dells, similar to that I have
reliably running in production.  I'll try to verify my findings on them,
and then be back.

Eugene
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 256 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20051114/9e35f78e/attachment.pgp>


More information about the drbd-user mailing list