[DRBD-user] Recommended procedure to fsck a drbd device

Wed Apr 14 12:30:47 CEST 2010

On Tue, Apr 13, 2010 at 01:51:20PM +0200, Jose Manuel Torralba wrote:
> Hi, this is my first post in this list and the question is regarding
> an inherited system of which I'm not completely familiar with. If I'm
> missing any necessary details, please ask.
> 
> We have a cluster, running drbd 8.2.6, both nodes were affected by a power cut.

Any volatile caches enabled?

Like, 500 MB of non-battery-backed controller cache?
or 7 * 32 MB of volatile on-disk caches?

Anything that was in volatile caches when the power was cut is lost.
I bet no file system would be particularly happy about that.

> OS is CentOS 5.1 64bit, kernel 2.6.18-53.1.19.el5.028stab053.14
>
> The problem is that it's not restarting because none of the nodes is
> mounting the drbd device with the following error:
> 
> Apr 13 08:34:25 nodo02 Filesystem[29553]: INFO: Running start for /dev/drbd0 on /vz
> Apr 13 08:34:25 nodo02 kernel: EXT3-fs error (device drbd0): ext3_check_descriptors: Block bitmap for group 5376 not in group (block 662340004)!
> Apr 13 08:34:25 nodo02 kernel: EXT3-fs: group descriptors corrupted!
> Apr 13 08:34:25 nodo02 Filesystem[29553]: ERROR: Couldn't mount filesystem /dev/drbd0 on /vz
> 
> We have stopped services in both nodes and, in order to isolate the
> problem, we have executed in one of the nodes (both behave the same):
> 
> service drbd start
> drbdadm primary r0
> mount /dev/drbd0 /vz
> 
> Last lines of dmesg:
> 
> drbd: initialised. Version: 8.0.7 (api:86/proto:86)
> drbd: GIT-hash: cf14288833afe95db396075f8530a5960d29e498 build by phil at mescal, 2007-11-02 13:15:41
> drbd: registered as block device major 147
> drbd: minor_table @ 0xffff8102245d2ec0
> drbd0: disk( Diskless -> Attaching )
> drbd0: Found 6 transactions (324 active extents) in activity log.
> drbd0: max_segment_size ( = BIO size ) = 32768
> drbd0: drbd_bm_resize called with capacity == 2929554040
> drbd0: resync bitmap: bits=366194255 words=5721786
> drbd0: size = 1396 GB (1464777020 KB)
> drbd0: reading of bitmap took 619 jiffies
> drbd0: recounting of set bits took additional 45 jiffies
> drbd0: 4 KB marked out-of-sync by on disk bit-map.
> drbd0: disk( Attaching -> UpToDate )
> drbd0: Writing meta data super block now.
> drbd0: conn( StandAlone -> Unconnected )
> drbd0: receiver (re)started
> drbd0: conn( Unconnected -> WFConnection )
> drbd0: role( Secondary -> Primary )
> drbd0: Writing meta data super block now.
> EXT3-fs error (device drbd0): ext3_check_descriptors: Block bitmap for group 5376 not in group (block 662340004)!
> EXT3-fs: group descriptors corrupted!
> 
> We think the the next step is to run fsck in one node and then force the other node to syncronize.
> 
> A read only run of fsck.ext3 over /dev/drbd0 returns an awful lot of errors.
> 
> The question is if this is the recommended procedure to fsck a drbd device.

If you live on LVM, you could snapshot before fsck, just in case.
Though a snapshot won't speed up the fsck, of course.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed