Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars, none of your suggestions caused the drbd device to "unstick". Both
nodes were using anticipatory io-scheduling (changing to deadline didn't
get it going again, although I wanted to be running deadline, anyway, so
it's good to know that I wasn't).
Here are the relevant entries from /proc/drbd:
Secondary:
3: cs:ServerForDLess st:Secondary/Primary ld:Consistent
ns:772036 nr:18499220 dw:18499220 dr:772036 al:0 bm:465 lo:0 pe:0 ua:0 ap:0
Primary:
3: cs:DiskLessClient st:Primary/Secondary ld:Inconsistent
ns:18499696 nr:0 dw:13933352 dr:4995314 al:5062 bm:626 lo:2 pe:0 ua:0 ap:0
Thanks,
Brent
On Tue, 25 Jul 2006, Brent A Nelson wrote:
> Some additional info: the mkfs is still hung and a subsequent attempt also
> hung. A short dd to the device did not hang, but it completed far too
> quickly and showed no activity on the secondary. A longer dd did hang.
>
> The machine has three stuck processes and top shows that the machine is in
> 100% wait.
>
> All 6 drbd devices have LVM logical volumes for their backing store (I used
> logical volumes so that the block devices wouldn't get reordered by the
> system if a disk disappeared; perhaps there's a better way). 3 disks are
> secondary for the other machine, and 3 disks are primary.
>
> Could this be an issue with drbd on LVM? Or maybe something that's fixed by a
> newer drbd version? A bug when compiled with gcc-3.4, maybe? Is there
> anything I should try to help diagnose the situation before I attempt to
> recover (these machines are not yet in production, so I can wait a bit, if
> needed)?
>
> Thanks,
>
> Brent
>
> On Mon, 24 Jul 2006, Brent A Nelson wrote:
>
>> I experienced a disk failure today when doing mkfs on one of 6 drbd
>> devices, which resulted in the process getting stuck in the "D" state.
>>
>> dmesg shows a series of SCSI errors and then the following on the primary:
>>
>> drbd3: drbd_md_sync_page_io(,390455306,WRITE) failed!
>> drbd3: Notified peer that my disk is broken.
>>
>> The secondary went to the "ServerForDLess" state and the primary went to
>> "DiskLessClient".
>>
>> This all seems like a normal drbd response, right? But, although I think I
>> can read from the device (read attempts don't report any errors, and the
>> secondary drbd processes seem to be busy serving data when I attempt a
>> read), I can't seem to write to it. I imagine if I switch the secondary
>> over to primary all will be well, but the primary should be able to pass
>> both reads and writes to the secondary in the event of its own disk
>> failing, correct?
>>
>> Is there something I'm doing wrong or a bug in my drbd (version 0.7.15 in
>> Ubuntu Dapper but running a 2.6.12 kernel)?
>>
>> Thanks,
>>
>> Brent Nelson
>> Director of Computing
>> Dept. of Physics
>> University of Florida
>>
>