Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Mon, Mar 11, 2013 at 5:52 AM, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > On Mon, Mar 04, 2013 at 04:44:37PM -0500, Jesus Climent wrote: >> Any luck with this? > > Not enough context to be able to debug this. > Stack traces look normal, > and even the "(stalled)" thingy in /proc/drbd > does not need to be a cause of concern on a busy server. The problem with the stalled connection is that it really stays like that, even if the server stops being busy. I have managed to reproduce this case and the only way to get out of it that i have managed is bringing down the replication interface. Up until that point the upper layer of cluster management (ganeti) believes the migration is in progress and does not allow the nodes to perform any other action (due to locking). As I said, I managed to break the lock by bringing down and up the network interface, and even on a non-busy server, restarting the sync by bringing down the secondary and restarting the sync process, *sometimes* the sync process gets again in a stalled situation. > Maybe it helps if you correlate the lower level device IO queues, > and the network socket buffers as well. How can I do that? -- climent () gmail ! com