[DRBD-user] drbd replicates errors too!

Thu Apr 26 16:32:53 CEST 2007

Just running fsck on /dev/drbd0 will require bringing the cluster off
lines so that I can umount the filesystem -- which isn't really an
option.  So it sounds like I really am stuck with taking one node off
line to perform the fsck and then switching over to it so that it
becomes master and then re-syncing the other node?

Luciano Miguel Ferreira Rocha wrote:

These don't indicate I/O errors, only data integrity errors. In this
case, you should be able to fsck /dev/drbd0 and have the fix replicate
to the slave.

> On Thu, Apr 26, 2007 at 10:12:09AM -0400, Charles Bennington wrote:
>   
>> I am struggling to come up with a solution to a problem and would
>> appreciate any advice.  This is not strictly a DRBD issue but the nature
>> of drbd's block replicated filesystem leaves me in a sticky situation. 
>> The problem is that after suffering an IO error of some sort on one disk
>> the error is replicated to the secondary node.  Are there any
>> recommended or standard operating procedures people use to deal with
>> this situation?  The only think I can think to do is to take the
>> original primary node where the error occurred off line, run an fsck on
>> it and then bring it back online as the master and invalidate the
>> secondary so that it fully replicates the FSCK-ed drive on the primary. 
>> Not only does this sound a little scary from a logistical point of view,
>> it also sounds like I will lose data or be forced to run the cluster
>> read-only during the time it takes to fsck the primary.
>>
>> Unfortunately, this is a production server involved in intense
>> read/write activity around the clock.  I am hoping that perhaps I have
>> missed some simpler solution to my problem and so I am reaching out to
>> the drbd community.
>>
>> Here is some information about the cluster:
>>
>> DRBD 0.7.23 (with Heartbeat 2.0.7)
>> CentOS 4.4
>> Kernel 2.6.15.7
>> Dell PE 2850, Perc RAID 5 with 4 146GB HDs for data and the meta-disk on
>> a separate partition on a separate RAID array.
>>
>> Primary Node:
>> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
>> ext3_add_entry: bad entry in directory #84378594: rec_len % 4 != 0 -
>> offset=0, inode=1431655765, rec_len=21845, name_len=85
>> Apr 21 12:28:08 dc1con107 kernel: Aborting journal on device drbd0.
>> Apr 21 12:28:08 dc1con107 kernel: ext3_abort called.
>> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0):
>> ext3_journal_start_sb: Detected aborted journal
>> Apr 21 12:28:08 dc1con107 kernel: Remounting filesystem read-only
>> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
>> start_transaction: Journal has aborted
>> Apr 21 12:28:08 dc1con107 kernel: EXT3-fs error (device drbd0) in
>> ext3_create: IO failure
>>
>> Secondary Node (After the failure when it becomes the primary):
>> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
>> ext3_clear_journal_err: Filesystem error recorded from previous mount:
>> IO failure
>> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning (device drbd0):
>> ext3_clear_journal_err: Marking fs in need of filesystem check.
>> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs warning: mounting fs with
>> errors, running e2fsck is recommended
>> Apr 21 13:06:17 dc1con108 kernel: EXT3 FS on drbd0, internal journal
>> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: recovery complete.
>> Apr 21 13:06:17 dc1con108 kernel: EXT3-fs: mounted filesystem with
>> ordered data mode.
>>     
>
>
>   
> ------------------------------------------------------------------------
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>   

-- 
Charles Bennington
Oddcast, Inc.

direct: (646) 378-4327
main:   (212) 375-6290 x327