Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Mar 11, 2013 at 07:23:45AM -0400, Jesus Climent wrote: > On Mon, Mar 11, 2013 at 5:52 AM, Lars Ellenberg > <lars.ellenberg at linbit.com> wrote: > > On Mon, Mar 04, 2013 at 04:44:37PM -0500, Jesus Climent wrote: > >> Any luck with this? > > > > Not enough context to be able to debug this. > > Stack traces look normal, > > and even the "(stalled)" thingy in /proc/drbd > > does not need to be a cause of concern on a busy server. > > The problem with the stalled connection is that it really stays like > that, even if the server stops being busy. I have managed to reproduce > this case and the only way to get out of it that i have managed is > bringing down the replication interface. Up until that point the upper > layer of cluster management (ganeti) believes the migration is in > progress and does not allow the nodes to perform any other action (due > to locking). > > As I said, I managed to break the lock by bringing down and up the > network interface, and even on a non-busy server, restarting the sync > by bringing down the secondary and restarting the sync process, > *sometimes* the sync process gets again in a stalled situation. > > > Maybe it helps if you correlate the lower level device IO queues, > > and the network socket buffers as well. > > How can I do that? netstat or ss would be my prefered way for the network sockets, /proc/diskstats or iostat or similar for the io stack. -- : Lars Ellenberg : LINBIT | Your Way to High Availability : DRBD/HA support and consulting http://www.linbit.com DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.