Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> > you could try and tell drbd to no longer use zero copy send using > sendpage, > > but always do an actual data copy to the socket buffer, which should > > avoid the described problem. easiest way to do so: use DRBD protocol > A, > > and see if these crashes still occur. > > Given that I seem to have a reproducable test case (see below), that > should be easy enough to try. > > YUP, tried it. I have NOT exhaustively tested it, but given that > typing sync without any significant I/O preceeding it (and I'm in > the guest in single user mode), is enough to crash dom0 (proto C), > and I can't crash the box at all with proto A, I think you know the > core cause. Another thing to try is disabling TSO on the NIC with "ethtool -K ethN tx off" -- if you hitting another variant of the bug we've seen that Lars alluded to then disabling TSO will also disable zerocopy (DRBD will still try but the kernel will quietly convert to non-zc) which avoids the bad Xen/DRBD/TCP interaction The "nice" thing about this fix is that you can still use the "good" DRBD protocol. We have this command embedded in /etc/rc.local so it's disabled on every boot and it's been working well... BTW - the stack traces are not ones I've seen before _but_ crapping out calculating the checksum is a symptom. Simon