[DRBD-user] Kernel hung on DRBD / MD RAID

Mon Mar 5 23:58:17 CET 2012

On Mon, Mar 5, 2012 at 11:45 PM, Andreas Bauer <ab at voltage.de> wrote:
> I can share an observation:
>
> (Disclaimer: my knowledge of the Linux I/O stack is very limited)
>
> Kernel 3.1.0, DRBD 8.3.11, DRBD->LVM->MD-RAID1->SATA DISKS
> (Disks use CFQ scheduler)
>
> issue command: drbdadm verify all
> (with combined sync rate set to exceed disk performance)
>
> The system will become totally unresponsive, up to the the point that all processes will wait longer than 120s to complete any I/O. In fact their I/O does not get through until I/O load from the DRBD Verify reduces because some volumes have completed their run.

Sorry but this is a bit like:

"Doctor, I poked a rusty knife into my eye..."
"Yes?"
"... and now I have a problem."
"Well you already said that."

If you're telling your system to use an sync/verify rate that you
_know_ to be higher than what the disk can handle, then kicking off a
verify (drbdadm verify) or full sync (drbdadm invalidate-remote) will
badly beat up your I/O stack.

The documentation tells you to use a sync rate that doesn't exceed
about one third of your available bandwidth. You can also use
variable-rate synchronization which should take care of properly
throttling the syncer rate for you. But by deliberately setting a sync
rate that exceeds disk bandwidth, you're begging for trouble. Why
would you want to do this?

The CFQ I/O scheduler is a bad choice for servers too, but that's
probably the lesser of your concerns right now.

Cheers,
Florian

-- 
Need help with High Availability?
http://www.hastexo.com/now