[DRBD-user] [SOLVED] primary crashing under high load with Intel 82574L and big mtu

Mon Jul 7 16:58:27 CEST 2014

Am 04.07.14 14:27, schrieb Lars Ellenberg:
> On Mon, Jun 30, 2014 at 04:25:01PM +0200, Andreas Pflug wrote:
>> I'm running a pair of Debian wheezy machines A and B with Xen 4.1 and a
>> wheezy-backport kernel 3.14.7 to have drbd 8.4.3.
>> Both machines are interconnected with a dedicated 1GB Intel 82574L
>> (e1000e) drbd link, mtu=9216, storage is a fast battery-backed caching
>> RAID controller each.
>>
>> There are 3 drbd devices active:
>> drbd7 UpToDate/UpToDate and being filled by a vm using rsync (steady
>> data rate of about 5.7MB/s)
>> drbd12 UpToDate/Incosistent being resynced after initial creation, about
>> 50MB/s
>> drbd22 UpToDate/UpToDate being filled with arbitrary data using scp,
>> some MB/s.
>>
>> Resync is configured c-fill-target 512k; c-min-rate 2M; c-max-rate 50M
>>
>> I've observed several crashes of the machine that has drbd7 and drbd22
>> primary, whether I use machine A or B for that. The kernel log of the
>> faulting maching shows zero logging, just freshly rebooting, the other
>> that stays up doesn't have any hint in kern.log either. Everthing is
>> fine until "PingAck not arrived" and all connections go down.
>>
>> How can I find out what's leading to the crash?
> Attach a serial console,
> log to it,
> capture the output.

Thanks for the hint.
Did that, and the kernel console spits out several weird messages from
the e1000e driver, resulting in a kernel panic. Doing some iperf tests,
I see horrible dropped RX packet counts when the mtu is set to 9216,
whil everything looks fine with mtu=1500. Luckily, there's also an
unused I350 port in the machine, which behaves flawlessly with big
packets so I switched the drbd connection.

Conclusion:
Don't use Intel 82574L with MTU != 1500.

Regards,
Andreas