[DRBD-user] DRBD error -5 - I/O Error

Sat Jan 5 02:51:28 CET 2008

On Fri, Jan 04, 2008 at 06:07:52PM -0300, Ítalo Rossi wrote:
> Hello all,
> 
> I have two servers with drbd0.7.24 with LVM and XFS filesystem.
> 
> My Storages:
> 
> 
> Server 1( Red Hat 4)                           		Server 2( 
> Debian Etch)
>         | 								 |
> 	|								 |
>    DRBD   -   -   -   -  -   - > -   -  - -   -   -   -   -   -   -   
> DRBD
>         |								 |
>         |								 |
>      LVM2 (3.2T) 					   		   
>      RAID0 (3.2T)   --> MD1000 (1 HBA)
>          |
>          |
>          |
>       /    \
>     /        \				__
>   /       RAID0(1.6T)           |
>  |					    | -> MD3000(2 HBAs)
> RAID0(1.6T)		       __|
> 
> So, if the Server 1 become primary  I have this messages on dmesg (on- 
> io-error = "pass_on"):
> 
> mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code=(0x17),  
> SubCode(0x0000)
> mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code=(0x17),  
> SubCode(0x0000)
> mptbase: ioc0: LogInfo(0x31170000): Originator={PL}, Code=(0x17),  
> SubCode(0x0000)
> SCSI error : <0 0 21 1> return code = 0x20008
> end_request: I/O error, dev sdb, sector 2942884192
> drbd1: Ignoring local IO error!
...

this is a real scsi error,
i.e. the MD3000 hardware reports an IO error to the kernel.

google for
 mptbase SCSI error end_request I/O error

> I need to umount the drbd, run xfs_repair and remount, this solves my  
> problem but in 1 week it brokes again..
> 
> So, trying to solve this issue, I set the Server 2 in primary state ad  
> Server 1 secondary and I'm still getting this errors on Server 1  
> dmesg, but my application still running without problems:

...

> Is this a DRBD, LVM2 or MD3000 (modules or cable) issue?

5min of googling suggests that this might have been work-arounded
in the kernel driver of more recent kernels (kernel.org 2.6.19 and
later, potentially backported into various, but not all, "stable"
(recent four digits kernel releases), very difficult to tell for vendor
kernel versions within 5minutes), by throttling the transmission rate or
something.  aparently there is also some bios setting in the MD3000 to
reduce transfer rate there, to avoid the problem with older kernels.

redhat bugzillas for similar looking issues for redheat 3
and 4 and fedora core (various versions) exist.
(add bugzilla.redhat.com to the above google keywords).

I suggest you contact vendor support for recommendations.

one DRBD related note, still:

I'm not exactly sure from the top of my head
how "on-io-error pass_on" behaves in 0.7.
we use this setting not too often.

but in any case the corresponding sectors
will be out-of-sync now, you need to at least
disconnect/reconnect the drbd pair, to have them
be resynced, otherwise on the next failover,
suprise, some sectors (those where drbd
was not able to write the data, but ignored
the io error because you configured it to do so)
will contain "unexpected data".

-- 
: Lars Ellenberg                            Tel +43-1-8178292-55 :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :