<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#330033">
<font size="-1"><font face="Helvetica, Arial, sans-serif">I am
struggling to come up with a solution to a problem and would appreciate
any advice. This is not strictly a DRBD issue but the nature of drbd's
block replicated filesystem leaves me in a sticky situation. The
problem is that after suffering an IO error of some sort on one disk
the error is replicated to the secondary node. Are there any
recommended or standard operating procedures people use to deal with
this situation? The only think I can think to do is to take the
original primary node where the error occurred off line, run an fsck on
it and then bring it back online as the master and invalidate the
secondary so that it fully replicates the FSCK-ed drive on the
primary. Not only does this sound a little scary from a logistical
point of view, it also sounds like I will lose data or be forced to run
the cluster read-only during the time it takes to fsck the primary.<br>
<br>
Unfortunately, this is a production server involved in intense
read/write activity around the clock. I am hoping that perhaps I have
missed some simpler solution to my problem and so I am reaching out to
the drbd community.<br>
<br>
Here is some information about the cluster:<br>
<br>
DRBD 0.7.23 (with Heartbeat 2.0.7)<br>
CentOS 4.4<br>
Kernel 2.6.15.7<br>
Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk
on a separate partition on a separate RAID array.<br>
<br>
Primary Node:<br>
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 -
offset=0, inode=1431655765, rec_len=21845, name_len=85<br>
Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0.<br>
Apr 21 12:28:08 dc1con107 kernel: ext3_abort called.<br>
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
ext3_journal_start_sb: Detected aborted journal<br>
Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only<br>
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
start_transaction: Journal has aborted<br>
Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
ext3_create: IO failure<br>
<br>
Secondary Node (After the failure when it becomes the primary):<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
ext3_clear_journal_err: Filesystem error recorded from previous mount:
IO failure<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
ext3_clear_journal_err: Marking fs in need of filesystem check.<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with
errors, running e2fsck is recommended<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete.<br>
Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with
ordered data mode.<br>
<br>
<br>
</font></font>
<pre class="moz-signature" cols="72">--
Charles Bennington
Oddcast, Inc.
direct: (646) 378-4327
main: (212) 375-6290 x327
</pre>
</body>
</html>