[DRBD-user] kernel crash when secondary disappears. centos 5.3kernel-xen issue?

Fri Apr 17 02:34:40 CEST 2009

> > you could try and tell drbd to no longer use zero copy send using
> sendpage,
> > but always do an actual data copy to the socket buffer, which should
> > avoid the described problem.  easiest way to do so: use DRBD
protocol
> A,
> > and see if these crashes still occur.
> 
> Given that I seem to have a reproducable test case (see below), that
> should be easy enough to try.
> 
> YUP, tried it. I have NOT exhaustively tested it, but given that
> typing sync without any significant I/O preceeding it (and I'm in
> the guest in single user mode), is enough to crash dom0 (proto C),
> and I can't crash the box at all with proto A, I think you know the
> core cause.

Another thing to try is disabling TSO on the NIC with "ethtool -K ethN
tx off" -- if you hitting another variant of the bug we've seen that
Lars alluded to then disabling TSO will also disable zerocopy (DRBD will
still try but the kernel will quietly convert to non-zc) which avoids
the bad Xen/DRBD/TCP interaction

The "nice" thing about this fix is that you can still use the "good"
DRBD protocol.

We have this command embedded in /etc/rc.local so it's disabled on every
boot and it's been working well...

BTW - the stack traces are not ones I've seen before _but_ crapping out
calculating the checksum is a symptom.

Simon