[DRBD-user] Case of stalled connection.

Mon Mar 11 12:23:45 CET 2013

On Mon, Mar 11, 2013 at 5:52 AM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
> On Mon, Mar 04, 2013 at 04:44:37PM -0500, Jesus Climent wrote:
>> Any luck with this?
>
> Not enough context to be able to debug this.
> Stack traces look normal,
> and even the "(stalled)" thingy in /proc/drbd
> does not need to be a cause of concern on a busy server.

The problem with the stalled connection is that it really stays like
that, even if the server stops being busy. I have managed to reproduce
this case and the only way to get out of it that i have managed is
bringing down the replication interface. Up until that point the upper
layer of cluster management (ganeti) believes the migration is in
progress and does not allow the nodes to perform any other action (due
to locking).

As I said, I managed to break the lock by bringing down and up the
network interface, and even on a non-busy server, restarting the sync
by bringing down the secondary and restarting the sync process,
*sometimes* the sync process gets again in a stalled situation.

> Maybe it helps if you correlate the lower level device IO queues,
> and the network socket buffers as well.

How can I do that?

-- 
climent () gmail ! com