[DRBD-user] recovering from "Local IO failed. Detaching..."

Lars Ellenberg lars.ellenberg at linbit.com
Thu Sep 10 18:47:24 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Sep 10, 2009 at 06:28:23PM +0200, Gianluca Cecchi wrote:
> On Thu, Sep 10, 2009 at 6:20 PM, Lars Ellenberg
> <lars.ellenberg at linbit.com>wrote:
> 
> > [snip]
> >
> > I've seen similar symptoms before, and it could be worked around by
> > disabling offloading settings on the NICs used for the replication ;)
> > I know, that interaction sounds a bit far-fetched, but those are the
> > facts.
> >
> > # to view offload settings
> > ethtool -k eth7
> > # to switch them all off:
> > ethtool -K eth7 rx off tx off sg off tso off
> >
> >
> >
> [root at virtfedbis ~]# ethtool -k eth3
> Offload parameters for eth3:
> Cannot get device flags: Operation not supported
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
> 
> [root at virtfedbis ~]# ethtool -K eth3 rx off tx off sg off tso off
> 
> [root at virtfedbis ~]# ethtool -k eth3
> Offload parameters for eth3:
> Cannot get device flags: Operation not supported
> rx-checksumming: off
> tx-checksumming: off
> scatter-gather: off
> tcp-segmentation-offload: off
> udp-fragmentation-offload: off
> generic-segmentation-offload: on
> generic-receive-offload: off
> large-receive-offload: off
> 
> If I try the attach without doing the same settings on other peer eth3 I
> get:
> 
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
> )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Found 6 transactions (244
> active extents) in activity log.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Method to ensure write
> ordering: barrier
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
> size ) = 32768
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: recounting of set bits took
> additional 1 jiffies
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
> out-of-sync by on disk bit-map.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Marked additional 0 KB as
> out-of-sync based on AL.
> Sep 10 18:24:34 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
> sector 0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: meta data flush failed with
> status -95, disabling md-flushes
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Attaching ->
> Negotiating )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: drbd_sync_handshake:
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: self
> D5C42445B9F5C227:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
> bits:235520 flags:0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: peer
> A0332E51B243BEE1:D5C42445B9F5C227:FFEDAA5E725D8157:13925DF660B57F5D
> bits:309189 flags:0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Becoming sync target due to
> disk states.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
> ) disk( Negotiating -> Outdated )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
> WFSyncUUID )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget ) disk( Outdated -> Inconsistent )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Began resync as SyncTarget
> (will sync 1236756 KB [309189 bits set]).
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Resync aborted.
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: conn( SyncTarget ->
> Connected ) disk( Inconsistent -> Failed )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Local IO failed.
> Detaching...
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
> Sep 10 18:24:34 virtfedbis kernel: block drbd0: Notified peer that my disk
> is broken.
> 
> Even after setting same on other peer I get:
> 
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Diskless -> Attaching
> )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Found 6 transactions (244
> active extents) in activity log.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Method to ensure write
> ordering: barrier
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: max_segment_size ( = BIO
> size ) = 32768
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: recounting of set bits took
> additional 1 jiffies
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: 920 MB (235520 bits) marked
> out-of-sync by on disk bit-map.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Marked additional 0 KB as
> out-of-sync based on AL.
> Sep 10 18:26:06 virtfedbis kernel: end_request: I/O error, dev cciss/c0d0,
> sector 0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: meta data flush failed with
> status -95, disabling md-flushes
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: disk( Attaching ->
> Negotiating )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: drbd_sync_handshake:
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: self
> FAFACA8496A4ED9D:0000000000000000:0DB564243F5AA9A3:377245292BBD1112
> bits:235520 flags:0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: peer
> A0332E51B243BEE1:FAFACA8496A4ED9D:D5C42445B9F5C227:FFEDAA5E725D8157
> bits:310129 flags:0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: uuid_compare()=-1 by rule 50
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Becoming sync target due to
> disk states.
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( Connected -> WFBitMapT
> ) disk( Negotiating -> Outdated )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFBitMapT ->
> WFSyncUUID )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget ) disk( Outdated -> Inconsistent )
> Sep 10 18:26:06 virtfedbis kernel: block drbd0: Began resync as SyncTarget
> (will sync 1240516 KB [310129 bits set]).
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Resync aborted.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: conn( SyncTarget ->
> Connected ) disk( Inconsistent -> Failed )
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Local IO failed.
> Detaching...
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: 1121 messages suppressed in
> /root/drbd-8.3.3rc1/dist/BUILD/drbd-8.3.3rc1/drbd/drbd_receiver.c:1573.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: drbd_rs_complete_io()
> called, but extent not found
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: disk( Failed -> Diskless )
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Notified peer that my disk
> is broken.
> Sep 10 18:26:07 virtfedbis kernel: block drbd0: Can not write resync data to
> local disk.


too bad.
then something is wrong with your hardware, or your setup.
or your kernel.
or, of course, maybe only something is wrong with drbd (in your setup on
your hardware ;-])

care to try
        no-disk-flushes;
        no-md-flushes;
        no-disk-barrier;
?
if that does not help:
8.3.2?
8.3.3rc2?
various other drbd versions? kernels?
different lower level device? (not cciss? other cciss drive/partition?)
etc.

if all else fails: contact linbit, we do sell support.

we even sell "drbd health checks", which somewhat boils down to a
one-time engagement - though for those you may need to wait for a
suitable (for linbit) time-slot.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed



More information about the drbd-user mailing list