[DRBD-user] Kernel panic in skb_copy_bits

Wed Mar 11 10:25:50 CET 2009

On Tue, Mar 10, 2009 at 09:54:43AM -0400, Graham, Simon wrote:
> I would guess that you have TSO and scatter/gather enabled in your Dom0
> -- this crash is an unfortunate confluence of a number of behaviours
> that is only exposed when you are using Xen _and_ have scatter/gather
> enabled _and_ lose the link to the peer.
> 
> The problem is that when you have scatter/gather enabled, DRBD uses the
> zero-copy interface to the network stack -- the network stack has the
> annoying tendency to keep a reference to the pages of requests even
> after the TCP connection they were sent on is terminated _but_ DRBD
> makes the assumption that once the TCP connection to the peer is
> terminated, the pages of all pending requests are no longer referenced
> by the network stack and it can safely complete any pending I/O requests
> upwards.
> 
> Normally this is fine too, _but_ when you are running Xen and the
> request comes from blkback you will have a problem because when the I/O
> is completed to blkback, it unmaps the grant reference it has on the
> guests pages -- this effectively removes the page from Dom0 -- later on,
> if the network stack decides to use the cached page it has and you blow
> up with this oops.
> 
> So.. there are several problem areas here that conspire:
> 
> 1. DRBD is violating the block I/O contract -- when it completes an I/O,
> NOONE underneath it should have a reference to
>    any of the pages in the request - this is only a problem when
> zero-copy network send is used.
> 2. blkback is freeing (unmapping) pages when they are still referenced
> by lower layers -- really it should not do this 
>    until all the underlying references are gone (but it's assuming the
> lower layer followed the block I/O contract)
> 
> I think you have to disable scatter/gather support in the network layer
> until a proper fix can be determined...

Thanks for the explanation. Indeed kernel panic goes away if I disable 
s/g on the interface used by DRBD and restart all the connections:

  ethtool -K xenbr0 sg off
  drbdadm disconnect drbd*
  drbdadm connect drbd*

Any chance this could be fixed in DRBD, for example by explicitly cleaning
up the connection before returning I/O complete to Xen?

-- 
Valentin