[DRBD-user] Kernel hung on DRBD / MD RAID

Mon Mar 5 23:45:16 CET 2012

From:	Micha Kersloot <micha at kovoks.nl>
Sent:	Mon 05-03-2012 16:36

> > No I haven't gotten any further and do not have equipment for testing
> > at the moment, but put my effort into carefully scheduling all
> > cron-jobs / backups so that there are no RAID resync/verify runs
> > when heavy I/O can be expected. So far no recurrence. By default MD
> > resync runs first sunday each month, around 1am on Debian.
> 
> We've even disabled RAID1 on our production servers and fully rely on drbd 
> because we are unable to schedule the load. We hoped converting to KVM could 
> solve the problem because there is no separate hypervisor, but that is not the 
> solution you say. 
> 
> We use all the different components (xen, drbd, raid1) for several years now in 
> different combinations, but only this specific combination of components seem 
> to have problems. I have no idea where to start debugging, or even where to 
> start asking for help, do you have an idea?

I can share an observation:

(Disclaimer: my knowledge of the Linux I/O stack is very limited)

Kernel 3.1.0, DRBD 8.3.11, DRBD->LVM->MD-RAID1->SATA DISKS
(Disks use CFQ scheduler)

issue command: drbdadm verify all
(with combined sync rate set to exceed disk performance)

The system will become totally unresponsive, up to the the point that all processes will wait longer than 120s to complete any I/O. In fact their I/O does not get through until I/O load from the DRBD Verify reduces because some volumes have completed their run.

In this case with CFQ  scheduler on the disks I would expect that non-DRBD I/O would obviously be slow, but would eventually happen sooner that 120 seconds. Not the case.

If I had a spare test machine one test I would try was to run a MD resync, DRBD verify and some other I/O at the same time and see if I can reliably trigger the problem.

HTH,

Andreas