Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I hacked the drbd kernel module to put a call to dump_stack() where the EIO error is being returned to DRBD in drbd_endio_pri. I was able to trigger my problem on a test machine by simulating reads, and then kicking off a RAID rebuild. This rebuild is triggered each week by the raid-check cron job that is part of the CentOS mdadm package. # find /mnt/data -type f -exec cat "{}" > /dev/null \; & # /etc/cron.weekly/99-raid-check & /var/log/messages: -- May 26 10:56:40 ragoon6 kernel: md: syncing RAID array md0 May 26 10:56:40 ragoon6 kernel: md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc. May 26 10:56:40 ragoon6 kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reconstruction. May 26 10:56:40 ragoon6 kernel: md: using 128k window, over a total of 1953514496 blocks. May 26 10:58:36 ragoon6 kernel: block drbd0: p read: error=-5 May 26 10:58:36 ragoon6 kernel: May 26 10:58:36 ragoon6 kernel: Call Trace: May 26 10:58:36 ragoon6 kernel: [<ffffffff886398f2>] :drbd:drbd_endio_pri+0x66/0x129 May 26 10:58:36 ragoon6 kernel: [<ffffffff8811b34a>] :dm_mod:dec_pending+0x134/0x18e May 26 10:58:36 ragoon6 kernel: [<ffffffff8811c15b>] :dm_mod:__split_bio+0x398/0x3b0 May 26 10:58:36 ragoon6 kernel: [<ffffffff8811c94d>] :dm_mod:dm_request+0x115/0x124 May 26 10:58:36 ragoon6 kernel: [<ffffffff8001c040>] generic_make_request+0x211/0x228 May 26 10:58:36 ragoon6 kernel: [<ffffffff8001a893>] bio_alloc_bioset+0x89/0xd9 May 26 10:58:36 ragoon6 kernel: [<ffffffff886471c7>] :drbd:drbd_make_request_common+0xc00/0xc2b May 26 10:58:36 ragoon6 kernel: [<ffffffff8002e31f>] __wake_up+0x38/0x4f May 26 10:58:36 ragoon6 kernel: [<ffffffff886478b3>] :drbd:drbd_make_request_26+0x6c1/0x702 May 26 10:58:36 ragoon6 kernel: [<ffffffff800a0307>] autoremove_wake_function+0x0/0x2e May 26 10:58:36 ragoon6 kernel: [<ffffffff8001c040>] generic_make_request+0x211/0x228 May 26 10:58:36 ragoon6 kernel: [<ffffffff8811b3f2>] :dm_mod:__map_bio+0x4e/0x125 May 26 10:58:36 ragoon6 kernel: [<ffffffff8811bf39>] :dm_mod:__split_bio+0x176/0x3b0 May 26 10:58:36 ragoon6 kernel: [<ffffffff8811c94d>] :dm_mod:dm_request+0x115/0x124 May 26 10:58:36 ragoon6 kernel: [<ffffffff8001c040>] generic_make_request+0x211/0x228 May 26 10:58:36 ragoon6 kernel: [<ffffffff80023013>] mempool_alloc+0x31/0xe7 May 26 10:58:36 ragoon6 kernel: [<ffffffff80010ceb>] __find_get_block_slow+0xeb/0xf7 May 26 10:58:36 ragoon6 kernel: [<ffffffff80033488>] submit_bio+0xe4/0xeb May 26 10:58:36 ragoon6 kernel: [<ffffffff8001a78a>] submit_bh+0xf1/0x111 May 26 10:58:36 ragoon6 kernel: [<ffffffff800173ac>] ll_rw_block+0x8c/0xab May 26 10:58:36 ragoon6 kernel: [<ffffffff800e0f0c>] __breadahead+0x27/0x3b May 26 10:58:36 ragoon6 kernel: [<ffffffff886ac0b4>] :ext4:__ext4_get_inode_loc+0x2e3/0x370 May 26 10:58:36 ragoon6 kernel: [<ffffffff886b0abc>] :ext4:ext4_iget+0x52/0x4db May 26 10:58:36 ragoon6 kernel: [<ffffffff886b456c>] :ext4:ext4_lookup+0x82/0xc3 May 26 10:58:36 ragoon6 kernel: [<ffffffff80036e16>] __lookup_hash+0x10b/0x12f May 26 10:58:36 ragoon6 kernel: [<ffffffff800e7140>] lookup_one_len+0x53/0x61 May 26 10:58:36 ragoon6 kernel: [<ffffffff885f9d0a>] :nfsd:compose_entry_fh+0xcd/0x121 May 26 10:58:36 ragoon6 kernel: [<ffffffff885f9f62>] :nfsd:encode_entry+0x204/0x53c May 26 10:58:36 ragoon6 kernel: [<ffffffff80062ff8>] thread_return+0x62/0xfe May 26 10:58:36 ragoon6 kernel: [<ffffffff8006e189>] do_gettimeofday+0x40/0x90 May 26 10:58:36 ragoon6 kernel: [<ffffffff8005aa51>] getnstimeofday+0x10/0x28 May 26 10:58:36 ragoon6 kernel: [<ffffffff800a22cc>] ktime_get_ts+0x1a/0x4e May 26 10:58:36 ragoon6 kernel: [<ffffffff800bd3f3>] delayacct_end+0x5d/0x86 May 26 10:58:36 ragoon6 kernel: [<ffffffff80063a36>] __wait_on_bit+0x60/0x6e May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa2a5>] :nfsd:nfs3svc_encode_entry_plus+0xb/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff886a90a0>] :ext4:call_filldir+0x7f/0x99 May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff886a9363>] :ext4:ext4_readdir+0x1bd/0x536 May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff80022ef2>] file_move+0x36/0x44 May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff80035292>] vfs_readdir+0x77/0xa9 May 26 10:58:36 ragoon6 kernel: [<ffffffff885fa29a>] :nfsd:nfs3svc_encode_entry_plus+0x0/0x10 May 26 10:58:36 ragoon6 kernel: [<ffffffff885f1ea0>] :nfsd:nfsd_readdir+0x6d/0xc5 May 26 10:58:36 ragoon6 kernel: [<ffffffff885f9122>] :nfsd:nfsd3_proc_readdirplus+0xf8/0x220 May 26 10:58:36 ragoon6 kernel: [<ffffffff885ee1db>] :nfsd:nfsd_dispatch+0xd8/0x1d6 May 26 10:58:36 ragoon6 kernel: [<ffffffff8857f529>] :sunrpc:svc_process+0x454/0x71b May 26 10:58:36 ragoon6 kernel: [<ffffffff80064644>] __down_read+0x12/0x92 May 26 10:58:36 ragoon6 kernel: [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb May 26 10:58:36 ragoon6 kernel: [<ffffffff885ee746>] :nfsd:nfsd+0x1a5/0x2cb May 26 10:58:36 ragoon6 kernel: [<ffffffff8005dfb1>] child_rip+0xa/0x11 May 26 10:58:36 ragoon6 kernel: [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb May 26 10:58:36 ragoon6 kernel: [<ffffffff885ee5a1>] :nfsd:nfsd+0x0/0x2cb May 26 10:58:36 ragoon6 kernel: [<ffffffff8005dfa7>] child_rip+0x0/0x11 May 26 10:58:36 ragoon6 kernel: May 26 10:58:36 ragoon6 kernel: block drbd0: Local READ failed sec=92278400s size=4096 May 26 10:58:36 ragoon6 kernel: block drbd0: disk( UpToDate -> Failed ) May 26 10:58:36 ragoon6 kernel: block drbd0: Local IO failed in __req_mod.Detaching... May 26 10:58:36 ragoon6 kernel: block drbd0: helper command: /sbin/drbdadm pri-on-incon-degr minor-0 May 26 10:58:36 ragoon6 kernel: block drbd0: Sorry, I have no access to good data anymore. -- Should the rebuild be safe? I would assume so since this is the default on CentOS, the array goes through the check procedure each sunday in the early AM. Why would this cause me issues with DRBD?