Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, none of your suggestions caused the drbd device to "unstick". Both nodes were using anticipatory io-scheduling (changing to deadline didn't get it going again, although I wanted to be running deadline, anyway, so it's good to know that I wasn't). Here are the relevant entries from /proc/drbd: Secondary: 3: cs:ServerForDLess st:Secondary/Primary ld:Consistent ns:772036 nr:18499220 dw:18499220 dr:772036 al:0 bm:465 lo:0 pe:0 ua:0 ap:0 Primary: 3: cs:DiskLessClient st:Primary/Secondary ld:Inconsistent ns:18499696 nr:0 dw:13933352 dr:4995314 al:5062 bm:626 lo:2 pe:0 ua:0 ap:0 Thanks, Brent On Tue, 25 Jul 2006, Brent A Nelson wrote: > Some additional info: the mkfs is still hung and a subsequent attempt also > hung. A short dd to the device did not hang, but it completed far too > quickly and showed no activity on the secondary. A longer dd did hang. > > The machine has three stuck processes and top shows that the machine is in > 100% wait. > > All 6 drbd devices have LVM logical volumes for their backing store (I used > logical volumes so that the block devices wouldn't get reordered by the > system if a disk disappeared; perhaps there's a better way). 3 disks are > secondary for the other machine, and 3 disks are primary. > > Could this be an issue with drbd on LVM? Or maybe something that's fixed by a > newer drbd version? A bug when compiled with gcc-3.4, maybe? Is there > anything I should try to help diagnose the situation before I attempt to > recover (these machines are not yet in production, so I can wait a bit, if > needed)? > > Thanks, > > Brent > > On Mon, 24 Jul 2006, Brent A Nelson wrote: > >> I experienced a disk failure today when doing mkfs on one of 6 drbd >> devices, which resulted in the process getting stuck in the "D" state. >> >> dmesg shows a series of SCSI errors and then the following on the primary: >> >> drbd3: drbd_md_sync_page_io(,390455306,WRITE) failed! >> drbd3: Notified peer that my disk is broken. >> >> The secondary went to the "ServerForDLess" state and the primary went to >> "DiskLessClient". >> >> This all seems like a normal drbd response, right? But, although I think I >> can read from the device (read attempts don't report any errors, and the >> secondary drbd processes seem to be busy serving data when I attempt a >> read), I can't seem to write to it. I imagine if I switch the secondary >> over to primary all will be well, but the primary should be able to pass >> both reads and writes to the secondary in the event of its own disk >> failing, correct? >> >> Is there something I'm doing wrong or a bug in my drbd (version 0.7.15 in >> Ubuntu Dapper but running a 2.6.12 kernel)? >> >> Thanks, >> >> Brent Nelson >> Director of Computing >> Dept. of Physics >> University of Florida >> >