[DRBD-user] Case of stalled connection.

Mon Mar 11 14:32:18 CET 2013

On Mon, Mar 11, 2013 at 07:23:45AM -0400, Jesus Climent wrote:
> On Mon, Mar 11, 2013 at 5:52 AM, Lars Ellenberg
> <lars.ellenberg at linbit.com> wrote:
> > On Mon, Mar 04, 2013 at 04:44:37PM -0500, Jesus Climent wrote:
> >> Any luck with this?
> >
> > Not enough context to be able to debug this.
> > Stack traces look normal,
> > and even the "(stalled)" thingy in /proc/drbd
> > does not need to be a cause of concern on a busy server.
> 
> The problem with the stalled connection is that it really stays like
> that, even if the server stops being busy. I have managed to reproduce
> this case and the only way to get out of it that i have managed is
> bringing down the replication interface. Up until that point the upper
> layer of cluster management (ganeti) believes the migration is in
> progress and does not allow the nodes to perform any other action (due
> to locking).
> 
> As I said, I managed to break the lock by bringing down and up the
> network interface, and even on a non-busy server, restarting the sync
> by bringing down the secondary and restarting the sync process,
> *sometimes* the sync process gets again in a stalled situation.
> 
> > Maybe it helps if you correlate the lower level device IO queues,
> > and the network socket buffers as well.
> 
> How can I do that?

netstat or ss would be my prefered way for the network sockets,
/proc/diskstats or iostat or similar for the io stack.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.