[DRBD-user] XFS corruption keeps happening inside of DRBD

Mon Sep 4 10:36:34 CEST 2006

Thanks for the reply.

Part of the issue was the fact that I wasn't aware of the right way to
attach drbd0, but not MOUNT it, using the heartbeat/etc. scripts. Once
I figured that out, I was able to run xfs_repair on it. It had a lot
of little issues to fix; I assume that DRBD worked so well that it
kept replicating the corrupted filesystem.

When I said "reocurring" I actually meant every time it got mounted, a
minute or two later it would crash - just tonight. Apparently the
corrupted filesystem must have hit some point where it needed to be
repaired or could no longer function.

I've upgraded the kernel to the latest (so I would have all the NFS,
XFS, etc. stuff, just in case) and compiled the latest 0.7 branch DRBD
module for it. That didn't help, it was purely the XFS corruption it
seems.

Now I have it running though, seems to be fine once again. One thing
that I found on a mailing list that was useful, which of course I
can't find the URL now, but it was something about an ioctl error and
not being able to attach; it seems that even though /etc/drbd.conf had
the definitions, you have to run /etc/init.d/drbd stop and start to
get it to recreate some index or information about it.

Anyway for now, all is well; I wrote that email in the midst of chaos
thinking I might not be able to fix it.

Thanks.
- mike

On 9/4/06, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote:
> / 2006-09-03 19:37:07 -0700
> \ mike:
> > This keeps reoccuring - and for some reason, xfs_repair and xfs_check
> > can never be ran, because /dev/drbd0 always is marked as mounted and
> > writeable....
>
> we have several deployments of storage clusters with xfs
> up into the TB range, and did not see any problem.
>
> if you have a primary drbd0, and you did not mount it, you should be
> able to run xfs_repair on it.
>
> you say "reoccuring". when, in what circumstances, how frequently etc.
> if you ignore drbd for the moment while searching the web, you'll
> find similar reports, where the problem disapeared by replacing bad
> ram or bad disks, or by switching off the (not-battery-backed) write
> cache of the storage subsystem.
>
> one more wild guess: is the drbd module fitting the kernel exactly?
> we recently had issues with a drbd module being compiled for a slightly
> different kernel, aparently loadable without complaints, but binary
> incompatible with the kernel it was loaded into. it should not be
> possible at all to load a binary incompatible module, but ...
>
> > CentOS 4.2 (maybe 4.3)
> > kernel 2.6.9-22.0.2.ELsmp
> > arch: x86_64
> > drbadm version:
> > Version: 0.7.17 (api:77)
> > SVN Revision: 2093 build by buildcentos at v20z-x86-64, 2006-04-13 15:11:10
> >
> > XFS mounting filesystem drbd0
> > Starting XFS recovery on filesystem: drbd0 (dev: drbd0)
> > XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1583 of file
> > /home/buildcentos/rpmbuild/BUILD/xfs/xfs_alloc.c.  Caller
> > 0xffffffffa0007ed1
> >
> > Call Trace:<ffffffffa00062b6>{:xfs:xfs_free_ag_extent+389}
>  ...
> > Ending XFS recovery on filesystem: drbd0 (dev: drbd0)
> >
> > Anyone have any tidbits of wisdom here? It's a production system, and
> > I've been trying to stay in line with CentOS kernel updates + DRBD
> > updates... not run my own stuff. But this is beginning to drive me
> > nuts....
>
> --
> : Lars Ellenberg                                  Tel +43-1-8178292-0  :
> : LINBIT Information Technologies GmbH            Fax +43-1-8178292-82 :
> : Schoenbrunner Str. 244, A-1120 Vienna/Europe   http://www.linbit.com :
> __
> please use the "List-Reply" function of your email client.
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>