Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Thanks for the reply. Part of the issue was the fact that I wasn't aware of the right way to attach drbd0, but not MOUNT it, using the heartbeat/etc. scripts. Once I figured that out, I was able to run xfs_repair on it. It had a lot of little issues to fix; I assume that DRBD worked so well that it kept replicating the corrupted filesystem. When I said "reocurring" I actually meant every time it got mounted, a minute or two later it would crash - just tonight. Apparently the corrupted filesystem must have hit some point where it needed to be repaired or could no longer function. I've upgraded the kernel to the latest (so I would have all the NFS, XFS, etc. stuff, just in case) and compiled the latest 0.7 branch DRBD module for it. That didn't help, it was purely the XFS corruption it seems. Now I have it running though, seems to be fine once again. One thing that I found on a mailing list that was useful, which of course I can't find the URL now, but it was something about an ioctl error and not being able to attach; it seems that even though /etc/drbd.conf had the definitions, you have to run /etc/init.d/drbd stop and start to get it to recreate some index or information about it. Anyway for now, all is well; I wrote that email in the midst of chaos thinking I might not be able to fix it. Thanks. - mike On 9/4/06, Lars Ellenberg <Lars.Ellenberg at linbit.com> wrote: > / 2006-09-03 19:37:07 -0700 > \ mike: > > This keeps reoccuring - and for some reason, xfs_repair and xfs_check > > can never be ran, because /dev/drbd0 always is marked as mounted and > > writeable.... > > we have several deployments of storage clusters with xfs > up into the TB range, and did not see any problem. > > if you have a primary drbd0, and you did not mount it, you should be > able to run xfs_repair on it. > > you say "reoccuring". when, in what circumstances, how frequently etc. > if you ignore drbd for the moment while searching the web, you'll > find similar reports, where the problem disapeared by replacing bad > ram or bad disks, or by switching off the (not-battery-backed) write > cache of the storage subsystem. > > one more wild guess: is the drbd module fitting the kernel exactly? > we recently had issues with a drbd module being compiled for a slightly > different kernel, aparently loadable without complaints, but binary > incompatible with the kernel it was loaded into. it should not be > possible at all to load a binary incompatible module, but ... > > > CentOS 4.2 (maybe 4.3) > > kernel 2.6.9-22.0.2.ELsmp > > arch: x86_64 > > drbadm version: > > Version: 0.7.17 (api:77) > > SVN Revision: 2093 build by buildcentos at v20z-x86-64, 2006-04-13 15:11:10 > > > > XFS mounting filesystem drbd0 > > Starting XFS recovery on filesystem: drbd0 (dev: drbd0) > > XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1583 of file > > /home/buildcentos/rpmbuild/BUILD/xfs/xfs_alloc.c. Caller > > 0xffffffffa0007ed1 > > > > Call Trace:<ffffffffa00062b6>{:xfs:xfs_free_ag_extent+389} > ... > > Ending XFS recovery on filesystem: drbd0 (dev: drbd0) > > > > Anyone have any tidbits of wisdom here? It's a production system, and > > I've been trying to stay in line with CentOS kernel updates + DRBD > > updates... not run my own stuff. But this is beginning to drive me > > nuts.... > > -- > : Lars Ellenberg Tel +43-1-8178292-0 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Schoenbrunner Str. 244, A-1120 Vienna/Europe http://www.linbit.com : > __ > please use the "List-Reply" function of your email client. > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user >