[DRBD-user] Re: DRBD with disk failure.

Wed Jul 26 01:12:42 CEST 2006

Lars, none of your suggestions caused the drbd device to "unstick".  Both 
nodes were using anticipatory io-scheduling (changing to deadline didn't 
get it going again, although I wanted to be running deadline, anyway, so 
it's good to know that I wasn't).

Here are the relevant entries from /proc/drbd:
Secondary:
  3: cs:ServerForDLess st:Secondary/Primary ld:Consistent
     ns:772036 nr:18499220 dw:18499220 dr:772036 al:0 bm:465 lo:0 pe:0 ua:0 ap:0

Primary:
  3: cs:DiskLessClient st:Primary/Secondary ld:Inconsistent
     ns:18499696 nr:0 dw:13933352 dr:4995314 al:5062 bm:626 lo:2 pe:0 ua:0 ap:0

Thanks,

Brent

On Tue, 25 Jul 2006, Brent A Nelson wrote:

> Some additional info: the mkfs is still hung and a subsequent attempt also 
> hung.  A short dd to the device did not hang, but it completed far too 
> quickly and showed no activity on the secondary.  A longer dd did hang.
>
> The machine has three stuck processes and top shows that the machine is in 
> 100% wait.
>
> All 6 drbd devices have LVM logical volumes for their backing store (I used 
> logical volumes so that the block devices wouldn't get reordered by the 
> system if a disk disappeared; perhaps there's a better way).  3 disks are 
> secondary for the other machine, and 3 disks are primary.
>
> Could this be an issue with drbd on LVM? Or maybe something that's fixed by a 
> newer drbd version? A bug when compiled with gcc-3.4, maybe? Is there 
> anything I should try to help diagnose the situation before I attempt to 
> recover (these machines are not yet in production, so I can wait a bit, if 
> needed)?
>
> Thanks,
>
> Brent
>
> On Mon, 24 Jul 2006, Brent A Nelson wrote:
>
>> I experienced a disk failure today when doing mkfs on one of 6 drbd 
>> devices, which resulted in the process getting stuck in the "D" state.
>> 
>> dmesg shows a series of SCSI errors and then the following on the primary:
>> 
>> drbd3: drbd_md_sync_page_io(,390455306,WRITE) failed!
>> drbd3: Notified peer that my disk is broken.
>> 
>> The secondary went to the "ServerForDLess" state and the primary went to 
>> "DiskLessClient".
>> 
>> This all seems like a normal drbd response, right? But, although I think I 
>> can read from the device (read attempts don't report any errors, and the 
>> secondary drbd processes seem to be busy serving data when I attempt a 
>> read), I can't seem to write to it.  I imagine if I switch the secondary 
>> over to primary all will be well, but the primary should be able to pass 
>> both reads and writes to the secondary in the event of its own disk 
>> failing, correct?
>> 
>> Is there something I'm doing wrong or a bug in my drbd (version 0.7.15 in 
>> Ubuntu Dapper but running a 2.6.12 kernel)?
>> 
>> Thanks,
>> 
>> Brent Nelson
>> Director of Computing
>> Dept. of Physics
>> University of Florida
>> 
>