[DRBD-user] drbd network block device deadlock ?

Tue Feb 1 18:35:23 CET 2011

>> Ivan Frain wrote:
>>> http://lwn.net/Articles/195416/

Did you see the date on that?  It's from 2006.  Is there a more recent cite
anywhere?  

> Antonio Anselmi wrote:
>> OS disk is usually a local device and not managed by drbd, i.e. is
>> not a backing device. Unless I'm missing something
From: Ivan Frain <ivan.frain at euranova.eu>
> I think this problem can also happen for non-OS disks. Considering
> that we access a non-OS disk , we can have pages in memory that belong
> to this disk. In that case, the system may ask to flush these "dirty"
> pages to the non-OS disk. And here we are: potential deadlock as I
> explain in my first post.

FWIW, we had a bad query on a DB box running DRBD that ate all the RAM on the
machine.  Ivan's deadlock did *not* occur; what happened was that the
OOM-killer started up and rampaged through the process list before it killed
the DB server.  This caused its own problems, of course, but DRBD itself was
fine.

> I just want to have the conviction that this cannot happen in
> production.

Anything can happen in production.  However, based on my experiences with a
bunch of DRBD clusters over the last few years, the potential deadlock Ivan
brings up is far less likely than many other problems.  "Network switches
dying horribly" has caused more problems in our production environment than
any box running out of RAM has ever caused.

However, we're doing relatively standard Apache + PHP + MySQL things here. 
Adjust for your environment if you're doing things that are really
RAM-intensive, having the OOM-killer run relatively frequently, or anything
like that.  YMMV after all.

-- 
Matt G / Dances With Crows
The Crow202 Blog:  http://crow202.org/wordpress/
There is no Darkness in Eternity/But only Light too dim for us to see