[DRBD-user] Re: DRBD with disk failure.

Wed Jul 26 02:04:29 CEST 2006

I just noticed something else.  It appears that I can't mkfs a different 
drbd device on this node, either.  It hangs, just like the device with 
the faulty disk.

The drbd devices for which the node is secondary still seem to be 
responding, however (at least I see nothing which suggests otherwise).

Thanks,

Brent

On Tue, 25 Jul 2006, Brent A Nelson wrote:

> Lars, none of your suggestions caused the drbd device to "unstick".  Both 
> nodes were using anticipatory io-scheduling (changing to deadline didn't get 
> it going again, although I wanted to be running deadline, anyway, so it's 
> good to know that I wasn't).
>
> Here are the relevant entries from /proc/drbd:
> Secondary:
> 3: cs:ServerForDLess st:Secondary/Primary ld:Consistent
>    ns:772036 nr:18499220 dw:18499220 dr:772036 al:0 bm:465 lo:0 pe:0 ua:0 
> ap:0
>
> Primary:
> 3: cs:DiskLessClient st:Primary/Secondary ld:Inconsistent
>    ns:18499696 nr:0 dw:13933352 dr:4995314 al:5062 bm:626 lo:2 pe:0 ua:0 
> ap:0
>
> Thanks,
>
> Brent
>
> On Tue, 25 Jul 2006, Brent A Nelson wrote:
>
>> Some additional info: the mkfs is still hung and a subsequent attempt also 
>> hung.  A short dd to the device did not hang, but it completed far too 
>> quickly and showed no activity on the secondary.  A longer dd did hang.
>> 
>> The machine has three stuck processes and top shows that the machine is in 
>> 100% wait.
>> 
>> All 6 drbd devices have LVM logical volumes for their backing store (I used 
>> logical volumes so that the block devices wouldn't get reordered by the 
>> system if a disk disappeared; perhaps there's a better way).  3 disks are 
>> secondary for the other machine, and 3 disks are primary.
>> 
>> Could this be an issue with drbd on LVM? Or maybe something that's fixed by 
>> a newer drbd version? A bug when compiled with gcc-3.4, maybe? Is there 
>> anything I should try to help diagnose the situation before I attempt to 
>> recover (these machines are not yet in production, so I can wait a bit, if 
>> needed)?
>> 
>> Thanks,
>> 
>> Brent
>> 
>> On Mon, 24 Jul 2006, Brent A Nelson wrote:
>> 
>>> I experienced a disk failure today when doing mkfs on one of 6 drbd 
>>> devices, which resulted in the process getting stuck in the "D" state.
>>> 
>>> dmesg shows a series of SCSI errors and then the following on the primary:
>>> 
>>> drbd3: drbd_md_sync_page_io(,390455306,WRITE) failed!
>>> drbd3: Notified peer that my disk is broken.
>>> 
>>> The secondary went to the "ServerForDLess" state and the primary went to 
>>> "DiskLessClient".
>>> 
>>> This all seems like a normal drbd response, right? But, although I think I 
>>> can read from the device (read attempts don't report any errors, and the 
>>> secondary drbd processes seem to be busy serving data when I attempt a 
>>> read), I can't seem to write to it.  I imagine if I switch the secondary 
>>> over to primary all will be well, but the primary should be able to pass 
>>> both reads and writes to the secondary in the event of its own disk 
>>> failing, correct?
>>> 
>>> Is there something I'm doing wrong or a bug in my drbd (version 0.7.15 in 
>>> Ubuntu Dapper but running a 2.6.12 kernel)?
>>> 
>>> Thanks,
>>> 
>>> Brent Nelson
>>> Director of Computing
>>> Dept. of Physics
>>> University of Florida
>>> 
>> 
>