[DRBD-user] DRBD and EIO, stack trace.

Ben Timby btimby at gmail.com
Wed May 26 17:09:14 CEST 2010

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


I hacked the drbd kernel module to put a call to dump_stack() where
the EIO error is being returned to DRBD in drbd_endio_pri. I was able
to trigger my problem on a test machine by simulating reads, and then
kicking off a RAID rebuild. This rebuild is triggered each week by the
raid-check cron job that is part of the CentOS mdadm package.

# find /mnt/data -type f -exec cat "{}" > /dev/null \; &
# /etc/cron.weekly/99-raid-check &

/var/log/messages:
--
May 26 10:56:40 ragoon6 kernel: md: syncing RAID array md0
May 26 10:56:40 ragoon6 kernel: md: minimum _guaranteed_
reconstruction speed: 1000 KB/sec/disc.
May 26 10:56:40 ragoon6 kernel: md: using maximum available idle IO
bandwidth (but not more than 200000 KB/sec) for reconstruction.
May 26 10:56:40 ragoon6 kernel: md: using 128k window, over a total of
1953514496 blocks.
May 26 10:58:36 ragoon6 kernel: block drbd0: p read: error=-5
May 26 10:58:36 ragoon6 kernel:
May 26 10:58:36 ragoon6 kernel: Call Trace:
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886398f2>]
:drbd:drbd_endio_pri+0x66/0x129
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811b34a>]
:dm_mod:dec_pending+0x134/0x18e
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811c15b>]
:dm_mod:__split_bio+0x398/0x3b0
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811c94d>]
:dm_mod:dm_request+0x115/0x124
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8001c040>]
generic_make_request+0x211/0x228
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8001a893>] bio_alloc_bioset+0x89/0xd9
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886471c7>]
:drbd:drbd_make_request_common+0xc00/0xc2b
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8002e31f>] __wake_up+0x38/0x4f
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886478b3>]
:drbd:drbd_make_request_26+0x6c1/0x702
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800a0307>]
autoremove_wake_function+0x0/0x2e
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8001c040>]
generic_make_request+0x211/0x228
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811b3f2>]
:dm_mod:__map_bio+0x4e/0x125
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811bf39>]
:dm_mod:__split_bio+0x176/0x3b0
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8811c94d>]
:dm_mod:dm_request+0x115/0x124
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8001c040>]
generic_make_request+0x211/0x228
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80023013>] mempool_alloc+0x31/0xe7
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80010ceb>]
__find_get_block_slow+0xeb/0xf7
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80033488>] submit_bio+0xe4/0xeb
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8001a78a>] submit_bh+0xf1/0x111
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800173ac>] ll_rw_block+0x8c/0xab
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800e0f0c>] __breadahead+0x27/0x3b
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886ac0b4>]
:ext4:__ext4_get_inode_loc+0x2e3/0x370
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886b0abc>] :ext4:ext4_iget+0x52/0x4db
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886b456c>]
:ext4:ext4_lookup+0x82/0xc3
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80036e16>] __lookup_hash+0x10b/0x12f
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800e7140>] lookup_one_len+0x53/0x61
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885f9d0a>]
:nfsd:compose_entry_fh+0xcd/0x121
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885f9f62>]
:nfsd:encode_entry+0x204/0x53c
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80062ff8>] thread_return+0x62/0xfe
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8006e189>] do_gettimeofday+0x40/0x90
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8005aa51>] getnstimeofday+0x10/0x28
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800a22cc>] ktime_get_ts+0x1a/0x4e
May 26 10:58:36 ragoon6 kernel:  [<ffffffff800bd3f3>] delayacct_end+0x5d/0x86
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa29a>]
:nfsd:nfs3svc_encode_entry_plus+0x0/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa2a5>]
:nfsd:nfs3svc_encode_entry_plus+0xb/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886a90a0>]
:ext4:call_filldir+0x7f/0x99
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa29a>]
:nfsd:nfs3svc_encode_entry_plus+0x0/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff886a9363>]
:ext4:ext4_readdir+0x1bd/0x536
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa29a>]
:nfsd:nfs3svc_encode_entry_plus+0x0/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80022ef2>] file_move+0x36/0x44
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa29a>]
:nfsd:nfs3svc_encode_entry_plus+0x0/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80035292>] vfs_readdir+0x77/0xa9
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885fa29a>]
:nfsd:nfs3svc_encode_entry_plus+0x0/0x10
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885f1ea0>]
:nfsd:nfsd_readdir+0x6d/0xc5
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885f9122>]
:nfsd:nfsd3_proc_readdirplus+0xf8/0x220
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885ee1db>]
:nfsd:nfsd_dispatch+0xd8/0x1d6
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8857f529>]
:sunrpc:svc_process+0x454/0x71b
May 26 10:58:36 ragoon6 kernel:  [<ffffffff80064644>] __down_read+0x12/0x92
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885ee746>] :nfsd:nfsd+0x1a5/0x2cb
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb
May 26 10:58:36 ragoon6 kernel:  [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb
May 26 10:58:36 ragoon6 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11
May 26 10:58:36 ragoon6 kernel:
May 26 10:58:36 ragoon6 kernel: block drbd0: Local READ failed
sec=92278400s size=4096
May 26 10:58:36 ragoon6 kernel: block drbd0: disk( UpToDate -> Failed )
May 26 10:58:36 ragoon6 kernel: block drbd0: Local IO failed in
__req_mod.Detaching...
May 26 10:58:36 ragoon6 kernel: block drbd0: helper command:
/sbin/drbdadm pri-on-incon-degr minor-0
May 26 10:58:36 ragoon6 kernel: block drbd0: Sorry, I have no access
to good data anymore.
--

Should the rebuild be safe? I would assume so since this is the
default on CentOS, the array goes through the check procedure each
sunday in the early AM. Why would this cause me issues with DRBD?



More information about the drbd-user mailing list