Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I am struggling to come up with a solution to a problem and would appreciate any advice. This is not strictly a DRBD issue but the nature of drbd's block replicated filesystem leaves me in a sticky situation. The problem is that after suffering an IO error of some sort on one disk the error is replicated to the secondary node. Are there any recommended or standard operating procedures people use to deal with this situation? The only think I can think to do is to take the original primary node where the error occurred off line, run an fsck on it and then bring it back online as the master and invalidate the secondary so that it fully replicates the FSCK-ed drive on the primary. Not only does this sound a little scary from a logistical point of view, it also sounds like I will lose data or be forced to run the cluster read-only during the time it takes to fsck the primary. Unfortunately, this is a production server involved in intense read/write activity around the clock. I am hoping that perhaps I have missed some simpler solution to my problem and so I am reaching out to the drbd community. Here is some information about the cluster: DRBD 0.7.23 (with Heartbeat 2.0.7) CentOS 4.4 Kernel 2.6.15.7 Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on a separate partition on a separate RAID array. Primary Node: Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 - offset=0, inode=1431655765, rec_len=21845, name_len=85 Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0. Apr 21 12:28:08 dc1con107 kernel: ext3_abort called. Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): ext3_journal_start_sb: Detected aborted journal Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in start_transaction: Journal has aborted Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in ext3_create: IO failure Secondary Node (After the failure when it becomes the primary): Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): ext3_clear_journal_err: Filesystem error recorded from previous mount: IO failure Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): ext3_clear_journal_err: Marking fs in need of filesystem check. Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with errors, running e2fsck is recommended Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete. Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with ordered data mode. -- Charles Bennington Oddcast, Inc. direct: (646) 378-4327 main: (212) 375-6290 x327 -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070426/71e8e288/attachment.htm>