[DRBD-user] drbd-8.2.7 Local READ failed... ds: Diskless/UpToDate

Tue Dec 23 15:57:15 CET 2008

On Tue, Dec 23, 2008 at 09:31:31AM -0500, Roger Tsang wrote:
> Hi,
> 
> I'm using drbd-8.2.7 with "common { disk { on-io-error detach; } }" and see
> "drbd3: Local READ failed..." messages even after logs show drbd3 disk state
> changed to Diskless.  It seems drbd did not detached the local drbd3 disk.
> 
> It is causing load average to increase beyond 40 and the file system stacked on
> drbd3 to stall waiting for I/O (unacceptable).
> 
> If not a bug can there be an option to emulate drbd-0.7 behavior to detach
> local disk immediately on I/O error?

nothing to "emulate" there, as drbd 8 _does_ detach immediately.
until proven otherwise I'd say these are comming from requests already
submitted (before drbd "detached"), but not yet completed.

> Dec 22 10:25:38 node1 kernel: ata2: command 0x25 timeout, stat 0xd0 host_stat
> 0x1
> Dec 22 10:25:38 node1 kernel: ata2: status=0xd0 { Busy }
> Dec 22 10:25:38 node1 kernel: SCSI error : <1 0 0 0> return code = 0x8000002
> Dec 22 10:25:38 node1 kernel: sdb: Current: sense key: Aborted Command
> Dec 22 10:25:38 node1 kernel:     Additional sense: Scsi parity error
> Dec 22 10:25:38 node1 kernel: end_request: I/O error, dev sdb, sector 4057363
> Dec 22 10:25:38 node1 kernel: drbd3: got an _req_mod() errno of -5
> Dec 22 10:25:38 node1 kernel: drbd3: Local READ failed sec=1952848s size=4096
> Dec 22 10:25:38 node1 kernel: drbd3: disk( UpToDate -> Failed )
> Dec 22 10:25:38 node1 kernel: drbd3: Local IO failed. Detaching...
> Dec 22 10:25:38 node1 kernel: ATA: abnormal status 0xD0 on port 0xE007
> Dec 22 10:25:38 node1 last message repeated 2 times
> Dec 22 10:25:38 node1 kernel: drbd3: disk( Failed -> Diskless )
> Dec 22 10:25:38 node1 kernel: drbd3: Notified peer that my disk is broken.
> ...

what does /proc/drbd look like now?

> Dec 22 10:33:07 node1 watchdog[68054]: loadavg 37 24 12 is higher than the
> given threshold 36 27 18!
> Dec 22 10:33:07 node1 watchdog[68054]: shutting down the system because of
> error -3
> Dec 22 10:33:08 node1 kernel: ata2: command 0x25 timeout, stat 0xd0 host_stat
> 0x1
> Dec 22 10:33:08 node1 kernel: ata2: status=0xd0 { Busy }
> Dec 22 10:33:08 node1 kernel: SCSI error : <1 0 0 0> return code = 0x8000002
> Dec 22 10:33:08 node1 kernel: sdb: Current: sense key: Aborted Command
> Dec 22 10:33:08 node1 kernel:     Additional sense: Scsi parity error
> Dec 22 10:33:08 node1 kernel: end_request: I/O error, dev sdb, sector 235310987
> Dec 22 10:33:08 node1 kernel: drbd3: got an _req_mod() errno of -5
> Dec 22 10:33:08 node1 kernel: drbd3: Local READ failed sec=233206472s size=4096
> Dec 22 10:33:08 node1 kernel: ATA: abnormal status 0xD0 on port 0xE007
> Dec 22 10:33:08 node1 last message repeated 2 times
> ...
> Shutdown/reboot with sync took _very_ long; gets stuck waiting for drbd3!
> ...

did it finish, or did you need to hard-reset?

> Dec 22 11:13:10 node1 kernel: end_request: I/O error, dev sdb, sector 449904539
> Dec 22 11:13:10 node1 kernel: drbd3: got an _req_mod() errno of -5
> Dec 22 11:13:10 node1 kernel: drbd3: Local READ failed sec=447800024s size=4096
> ...
> Dec 22 11:14:10 node1 kernel: end_request: I/O error, dev sdb, sector 180695108
> Dec 22 11:14:10 node1 kernel: drbd3: got an _req_mod() errno of -5
> Dec 22 11:14:10 node1 kernel: drbd3: Local WRITE failed sec=178590593s size=512
> ...
> Dec 22 11:20:37 node1 syslogd 1.4.1: restart (remote reception).
> Dec 22 11:20:37 node1 syslog: syslogd startup succeeded

what exactly is your sdb, and what happened to it?

> ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> Life on your PC is safer, easier, and more enjoyable with Windows Vista . See
> how

now, is that so. really.  ;)

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed