[DRBD-user] drbd replicates errors too!

Thu Apr 26 16:16:48 CEST 2007

On Thu, Apr 26, 2007 at 10:12:09AM -0400, Charles Bennington wrote:
> I am struggling to come up with a solution to a problem and would
> appreciate any advice.  This is not strictly a DRBD issue but the nature
> of drbd's block replicated filesystem leaves me in a sticky situation. 
> The problem is that after suffering an IO error of some sort on one disk
> the error is replicated to the secondary node.  Are there any
> recommended or standard operating procedures people use to deal with
> this situation?  The only think I can think to do is to take the
> original primary node where the error occurred off line, run an fsck on
> it and then bring it back online as the master and invalidate the
> secondary so that it fully replicates the FSCK-ed drive on the primary. 
> Not only does this sound a little scary from a logistical point of view,
> it also sounds like I will lose data or be forced to run the cluster
> read-only during the time it takes to fsck the primary.
> 
> Unfortunately, this is a production server involved in intense
> read/write activity around the clock.  I am hoping that perhaps I have
> missed some simpler solution to my problem and so I am reaching out to
> the drbd community.
> 
> Here is some information about the cluster:
> 
> DRBD 0.7.23 (with Heartbeat 2.0.7)
> CentOS 4.4
> Kernel 2.6.15.7
> Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on
> a separate partition on a separate RAID array.
> 
> Primary Node:
> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
> ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 -
> offset=0, inode=1431655765, rec_len=21845, name_len=85
> Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0.
> Apr 21 12:28:08 dc1con107 kernel: ext3_abort called.
> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
> ext3_journal_start_sb: Detected aborted journal
> Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only
> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
> start_transaction: Journal has aborted
> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
> ext3_create: IO failure
> 
> Secondary Node (After the failure when it becomes the primary):
> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
> ext3_clear_journal_err: Filesystem error recorded from previous mount:
> IO failure
> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
> ext3_clear_journal_err: Marking fs in need of filesystem check.
> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with
> errors, running e2fsck is recommended
> Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal
> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete.
> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with
> ordered data mode.

These don't indicate I/O errors, only data integrity errors. In this
case, you should be able to fsck /dev/drbd0 and have the fix replicate
to the slave.

-- 
lfr
0/0
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070426/dc9be3c3/attachment.pgp>