Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Just running fsck on /dev/drbd0 will require bringing the cluster off lines so that I can umount the filesystem -- which isn't really an option. So it sounds like I really am stuck with taking one node off line to perform the fsck and then switching over to it so that it becomes master and then re-syncing the other node? Luciano Miguel Ferreira Rocha wrote: These don't indicate I/O errors, only data integrity errors. In this case, you should be able to fsck /dev/drbd0 and have the fix replicate to the slave. > On Thu, Apr 26, 2007 at 10:12:09AM -0400, Charles Bennington wrote: > >> I am struggling to come up with a solution to a problem and would >> appreciate any advice. This is not strictly a DRBD issue but the nature >> of drbd's block replicated filesystem leaves me in a sticky situation. >> The problem is that after suffering an IO error of some sort on one disk >> the error is replicated to the secondary node. Are there any >> recommended or standard operating procedures people use to deal with >> this situation? The only think I can think to do is to take the >> original primary node where the error occurred off line, run an fsck on >> it and then bring it back online as the master and invalidate the >> secondary so that it fully replicates the FSCK-ed drive on the primary. >> Not only does this sound a little scary from a logistical point of view, >> it also sounds like I will lose data or be forced to run the cluster >> read-only during the time it takes to fsck the primary. >> >> Unfortunately, this is a production server involved in intense >> read/write activity around the clock. I am hoping that perhaps I have >> missed some simpler solution to my problem and so I am reaching out to >> the drbd community. >> >> Here is some information about the cluster: >> >> DRBD 0.7.23 (with Heartbeat 2.0.7) >> CentOS 4.4 >> Kernel 2.6.15.7 >> Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on >> a separate partition on a separate RAID array. >> >> Primary Node: >> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): >> ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 - >> offset=0, inode=1431655765, rec_len=21845, name_len=85 >> Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0. >> Apr 21 12:28:08 dc1con107 kernel: ext3_abort called. >> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0): >> ext3_journal_start_sb: Detected aborted journal >> Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only >> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in >> start_transaction: Journal has aborted >> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in >> ext3_create: IO failure >> >> Secondary Node (After the failure when it becomes the primary): >> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): >> ext3_clear_journal_err: Filesystem error recorded from previous mount: >> IO failure >> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0): >> ext3_clear_journal_err: Marking fs in need of filesystem check. >> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with >> errors, running e2fsck is recommended >> Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal >> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete. >> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with >> ordered data mode. >> > > > > ------------------------------------------------------------------------ > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- Charles Bennington Oddcast, Inc. direct: (646) 378-4327 main: (212) 375-6290 x327