[DRBD-user] Kernel hung on DRBD / MD RAID
ab at voltage.de
Tue Mar 6 00:27:33 CET 2012
From: Florian Haas <florian at hastexo.com>
Sent: Mon 05-03-2012 23:59
> On Mon, Mar 5, 2012 at 11:45 PM, Andreas Bauer <ab at voltage.de> wrote:
> > I can share an observation:
> > (Disclaimer: my knowledge of the Linux I/O stack is very limited)
> > Kernel 3.1.0, DRBD 8.3.11, DRBD->LVM->MD-RAID1->SATA DISKS
> > (Disks use CFQ scheduler)
> > issue command: drbdadm verify all
> > (with combined sync rate set to exceed disk performance)
> > The system will become totally unresponsive, up to the the point that all
> processes will wait longer than 120s to complete any I/O. In fact their I/O
> does not get through until I/O load from the DRBD Verify reduces because some
> volumes have completed their run.
> Sorry but this is a bit like:
> "Doctor, I poked a rusty knife into my eye..."
> "... and now I have a problem."
> "Well you already said that."
Nice one. :-)
> If you're telling your system to use an sync/verify rate that you
> _know_ to be higher than what the disk can handle, then kicking off a
> verify (drbdadm verify) or full sync (drbdadm invalidate-remote) will
> badly beat up your I/O stack.
> The documentation tells you to use a sync rate that doesn't exceed
> about one third of your available bandwidth. You can also use
> variable-rate synchronization which should take care of properly
> throttling the syncer rate for you. But by deliberately setting a sync
> rate that exceeds disk bandwidth, you're begging for trouble. Why
> would you want to do this?
Because I want to badly beat up my I/O stack? The point of this exercise is to reproduce the kernel crash. So to stay with the image, the stack should be able to take a beating without dying in the process.
My point was that with DRBD verify I can produce a total I/O lock which otherwise with CFQ scheduler I was not able to produce.
I run my configuration with fixed sync rates at the moment, but this can hit the same situation, when disk performance is dramatically reduced, for example when a RAID resync runs on the underlying disks.
> The CFQ I/O scheduler is a bad choice for servers too, but that's
> probably the lesser of your concerns right now.
I am aware of the general recommendation to not use CFQ for servers, but I picked up the information that for a software RAID 1 on plain SATA rotating disks CFQ yields slightly better performance than the other schedulers. This would not apply to hardware RAID.
More information about the drbd-user