Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Mar 10, 2009 at 09:54:43AM -0400, Graham, Simon wrote: > I would guess that you have TSO and scatter/gather enabled in your Dom0 > -- this crash is an unfortunate confluence of a number of behaviours > that is only exposed when you are using Xen _and_ have scatter/gather > enabled _and_ lose the link to the peer. > > The problem is that when you have scatter/gather enabled, DRBD uses the > zero-copy interface to the network stack -- the network stack has the > annoying tendency to keep a reference to the pages of requests even > after the TCP connection they were sent on is terminated _but_ DRBD > makes the assumption that once the TCP connection to the peer is > terminated, the pages of all pending requests are no longer referenced > by the network stack and it can safely complete any pending I/O requests > upwards. > > Normally this is fine too, _but_ when you are running Xen and the > request comes from blkback you will have a problem because when the I/O > is completed to blkback, it unmaps the grant reference it has on the > guests pages -- this effectively removes the page from Dom0 -- later on, > if the network stack decides to use the cached page it has and you blow > up with this oops. > > So.. there are several problem areas here that conspire: > > 1. DRBD is violating the block I/O contract -- when it completes an I/O, > NOONE underneath it should have a reference to > any of the pages in the request - this is only a problem when > zero-copy network send is used. > 2. blkback is freeing (unmapping) pages when they are still referenced > by lower layers -- really it should not do this > until all the underlying references are gone (but it's assuming the > lower layer followed the block I/O contract) > > I think you have to disable scatter/gather support in the network layer > until a proper fix can be determined... Thanks for the explanation. Indeed kernel panic goes away if I disable s/g on the interface used by DRBD and restart all the connections: ethtool -K xenbr0 sg off drbdadm disconnect drbd* drbdadm connect drbd* Any chance this could be fixed in DRBD, for example by explicitly cleaning up the connection before returning I/O complete to Xen? -- Valentin