[DRBD-user] drbd replicates errors too!

Thu Apr 26 16:12:09 CEST 2007

I am struggling to come up with a solution to a problem and would
appreciate any advice.  This is not strictly a DRBD issue but the nature
of drbd's block replicated filesystem leaves me in a sticky situation. 
The problem is that after suffering an IO error of some sort on one disk
the error is replicated to the secondary node.  Are there any
recommended or standard operating procedures people use to deal with
this situation?  The only think I can think to do is to take the
original primary node where the error occurred off line, run an fsck on
it and then bring it back online as the master and invalidate the
secondary so that it fully replicates the FSCK-ed drive on the primary. 
Not only does this sound a little scary from a logistical point of view,
it also sounds like I will lose data or be forced to run the cluster
read-only during the time it takes to fsck the primary.

Unfortunately, this is a production server involved in intense
read/write activity around the clock.  I am hoping that perhaps I have
missed some simpler solution to my problem and so I am reaching out to
the drbd community.

Here is some information about the cluster:

DRBD 0.7.23 (with Heartbeat 2.0.7)
CentOS 4.4
Kernel 2.6.15.7
Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on
a separate partition on a separate RAID array.

Primary Node:
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 -
offset=0, inode=1431655765, rec_len=21845, name_len=85
Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0.
Apr 21 12:28:08 dc1con107 kernel: ext3_abort called.
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
ext3_journal_start_sb: Detected aborted journal
Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
start_transaction: Journal has aborted
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
ext3_create: IO failure

Secondary Node (After the failure when it becomes the primary):
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
ext3_clear_journal_err: Filesystem error recorded from previous mount:
IO failure
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
ext3_clear_journal_err: Marking fs in need of filesystem check.
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with
errors, running e2fsck is recommended
Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete.
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with
ordered data mode.

-- 
Charles Bennington
Oddcast, Inc.

direct: (646) 378-4327
main:   (212) 375-6290 x327

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070426/71e8e288/attachment.htm>