[DRBD-user] frequent wrong magic value with kernel >4.9

Thu Feb 1 13:34:54 CET 2018

On Tue, Jan 23, 2018 at 07:14:13PM +0100, Andreas Pflug wrote:
> Am 15.01.18 um 16:37 schrieb Andreas Pflug:
> > Am 09.01.18 um 16:24 schrieb Lars Ellenberg:
> >> On Tue, Jan 09, 2018 at 03:36:34PM +0100, Lars Ellenberg wrote:
> >>> On Mon, Dec 25, 2017 at 03:19:42PM +0100, Andreas Pflug wrote:
> >>>> Running two Debian 9.3 machines, directly connected via 10GBit on-board
> >>>>
> >>>> X540 10GBit, with 15 drbd devices.
> >>>>
> >>>> When running a 4.14.2 kernel (from sid) or a 4.13.13 kernel (from
> >>>> stretch-backports), I see several "Wrong magic value 0x4c414245 in
> >>>> protocol version 101" per day issued by the secondary, with subsequent
> >>>> termination of the connection, reconnect and resync. The magic value
> >>>> logged differs, quite often 0x00.
> >>>>
> >>>> Using the current 4.9.65 kernel (or older) from stretch didn't show
> >>>> these aborts in the past, and after going back they're gone again. It
> >>>> seems to be some problem introduced after 4.9 kernels, since both 4.9
> >>>> and 4.13 include drbd 8.4.7. Maybe some interference with the nic driver?
> >>>>
> >>>> Kernel    drbd   ixgbe     errors
> >>>> 4.9.65   8.4.7  4.4.0-k    no
> >>>> 4.13.13  8.4.7  5.1.0-k    yes
> >>>> 4.14.2   8.4.10 5.1.0-k    yes
> >>> "strange".
> >>>
> >>> What does "lsblk -D" and "lsblk -t" say?
> >>>
> >>> Do you have a scratch volume you can play with?
> >>> As a datapoint, you try to "blkdiscard /dev/drbdX" it?
> >>> dd if=/dev/zero of=/dev/drbdX bs=1G oflag=direct count=1?
> >>>
> >>> Something like that?
> >>> Any "easy" reproducer?
> >> Maybe while preparing the pull requests for upstream,
> >> we missed/mangled/broke something.
> >>
> >> Can you also reproduce with "out-of-tree" drbd 8.4.10?
> >>
> > So I have currently kernel 4.9.65 with drbd 8.3.7 on the primary server,
> > with the second server (4.14.7 with drbd 8.3.11-rc1) having all drbd
> > devices secondary.
> >
> > Llogged in kern.log on the secondary:
> > Jan 15 15:13:22 xen2 kernel: [451977.741177] drbd monitor.opt: Wrong
> > magic value 0x64656772 in protocol version 101
> 
> Any news on this issue, anything to test?
> Still getting that message 20 times a day, system not really busy.

Nothing I can make any sense of, yet.
And as of now, afaics, you are "the only one" reporting this.
Can be a lot of things.

Maybe you can setup a tcpdump capture in ringbuffer mode,
wait for this to happen (watching the kernel log),
and make me the pcap containing the event available somehow?

something like this (please double check the man page yourself):
tcpdump  -s 0 -i $NIC -w drbd.pcap. -W 100 -C 1 [possible port filter here]
(keep in mind that pcap will contain raw block device data,
which you may not want to show to "the internet").

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed