Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Thu, Apr 26, 2007 at 10:12:09AM -0400, Charles Bennington wrote: > I am struggling to come up with a solution to a problem and would > appreciate any advice. This is not strictly a DRBD issue but the nature > of drbd's block replicated filesystem leaves me in a sticky situation. > The problem is that after suffering an IO error of some sort on one disk > the error is replicated to the secondary node. Are there any > recommended or standard operating procedures people use to deal with > this situation? The only think I can think to do is to take the > original primary node where the error occurred off line, run an fsck on > it and then bring it back online as the master and invalidate the > secondary so that it fully replicates the FSCK-ed drive on the primary. > Not only does this sound a little scary from a logistical point of view, > it also sounds like I will lose data or be forced to run the cluster > read-only during the time it takes to fsck the primary. > > Unfortunately, this is a production server involved in intense > read/write activity around the clock. I am hoping that perhaps I have > missed some simpler solution to my problem and so I am reaching out to > the drbd community. > > Here is some information about the cluster: > > DRBD 0.7.23 (with Heartbeat 2.0.7) > CentOS 4.4 > Kernel 2.6.15.7 > Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on > a separate partition on a separate RAID array. > > Primary Node: > Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): > ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 - > offset=0, inode=1431655765, rec_len=21845, name_len=85 > Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0. > Apr 21 12:28:08 dc1con107 kernel: ext3_abort called. > Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): > ext3_journal_start_sb: Detected aborted journal > Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only > Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in > start_transaction: Journal has aborted > Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in > ext3_create: IO failure > > Secondary Node (After the failure when it becomes the primary): > Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): > ext3_clear_journal_err: Filesystem error recorded from previous mount: > IO failure > Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): > ext3_clear_journal_err: Marking fs in need of filesystem check. > Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with > errors, running e2fsck is recommended > Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal > Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete. > Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with > ordered data mode. These don't indicate I/O errors, only data integrity errors. In this case, you should be able to fsck /dev/drbd0 and have the fix replicate to the slave. -- lfr 0/0 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070426/dc9be3c3/attachment.pgp>