[DRBD-user] frequent wrong magic value with kernel >4.9 caused by big mtu

Mon Feb 12 17:17:24 CET 2018

Am 08.02.18 um 11:30 schrieb Andreas Pflug:
> Am 01.02.18 um 13:34 schrieb Lars Ellenberg:
>> On Tue, Jan 23, 2018 at 07:14:13PM +0100, Andreas Pflug wrote:
>>> Am 15.01.18 um 16:37 schrieb Andreas Pflug:
>>>> Am 09.01.18 um 16:24 schrieb Lars Ellenberg:
>>>>> On Tue, Jan 09, 2018 at 03:36:34PM +0100, Lars Ellenberg wrote:
>>>>>> On Mon, Dec 25, 2017 at 03:19:42PM +0100, Andreas Pflug wrote:
>>>>>>> Running two Debian 9.3 machines, directly connected via 10GBit on-board
>>>>>>>
>>>>>>> X540 10GBit, with 15 drbd devices.
>>>>>>>
>>>>>>> When running a 4.14.2 kernel (from sid) or a 4.13.13 kernel (from
>>>>>>> stretch-backports), I see several "Wrong magic value 0x4c414245 in
>>>>>>> protocol version 101" per day issued by the secondary, with subsequent
>>>>>>> termination of the connection, reconnect and resync. The magic value
>>>>>>> logged differs, quite often 0x00.
>>>>>>>
>>>>>>> Using the current 4.9.65 kernel (or older) from stretch didn't show
>>>>>>> these aborts in the past, and after going back they're gone again. It
>>>>>>> seems to be some problem introduced after 4.9 kernels, since both 4.9
>>>>>>> and 4.13 include drbd 8.4.7. Maybe some interference with the nic driver?
>>>>>>>
>>>>>>> Kernel    drbd   ixgbe     errors
>>>>>>> 4.9.65   8.4.7  4.4.0-k    no
>>>>>>> 4.13.13  8.4.7  5.1.0-k    yes
>>>>>>> 4.14.2   8.4.10 5.1.0-k    yes
>>>>>> "strange".
>>>>>>
>>>>>> What does "lsblk -D" and "lsblk -t" say?
>>>>>>
>>>>>> Do you have a scratch volume you can play with?
>>>>>> As a datapoint, you try to "blkdiscard /dev/drbdX" it?
>>>>>> dd if=/dev/zero of=/dev/drbdX bs=1G oflag=direct count=1?
>>>>>>
>>>>>> Something like that?
>>>>>> Any "easy" reproducer?
>>>>> Maybe while preparing the pull requests for upstream,
>>>>> we missed/mangled/broke something.
>>>>>
>>>>> Can you also reproduce with "out-of-tree" drbd 8.4.10?
>>>>>
>>>> So I have currently kernel 4.9.65 with drbd 8.3.7 on the primary server,
>>>> with the second server (4.14.7 with drbd 8.3.11-rc1) having all drbd
>>>> devices secondary.
>>>>
>>>> Llogged in kern.log on the secondary:
>>>> Jan 15 15:13:22 xen2 kernel: [451977.741177] drbd monitor.opt: Wrong
>>>> magic value 0x64656772 in protocol version 101
>>> Any news on this issue, anything to test?
>>> Still getting that message 20 times a day, system not really busy.
>> Nothing I can make any sense of, yet.
>> And as of now, afaics, you are "the only one" reporting this.
>> Can be a lot of things.
>>
>> Maybe you can setup a tcpdump capture in ringbuffer mode,
>> wait for this to happen (watching the kernel log),
>> and make me the pcap containing the event available somehow?
>>
>> something like this (please double check the man page yourself):
>> tcpdump  -s 0 -i $NIC -w drbd.pcap. -W 100 -C 1 [possible port filter here]
>> (keep in mind that pcap will contain raw block device data,
>> which you may not want to show to "the internet").
>>
> After the tcpdump analysis showed that the problem must be located below
> DRBD, I played around with eth settings. Cutting down the former MTU of
> 9710 to default 1500 did fix the problem, as well as disabling
> scatter-gather. So apparently big MTU and scatter-gather don't play
> nicely on later kernels (or the updated nic driver)
>
> I posted a kernel bug on this,
> https://bugzilla.kernel.org/show_bug.cgi?id=198723

Unfortunately, scatter-gather seems NOT to be the culprit. Changing all
other receive and generic offload settings didn't help either, so only
big mtu remains.

Regards,
Andreas