<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    Am 09.01.18 um 16:24 schrieb Lars Ellenberg:<br>

    <span style="white-space: pre-wrap; display: block; width: 98vw;">&gt; On Tue, Jan 09, 2018 at 03:36:34PM +0100, Lars Ellenberg wrote:

&gt;&gt; On Mon, Dec 25, 2017 at 03:19:42PM +0100, Andreas Pflug wrote:

&gt;&gt;&gt; Running two Debian 9.3 machines, directly connected via 10GBit

&gt;&gt;&gt; on-board

&gt;&gt;&gt; 

&gt;&gt;&gt; X540 10GBit, with 15 drbd devices.

&gt;&gt;&gt; 

&gt;&gt;&gt; When running a 4.14.2 kernel (from sid) or a 4.13.13 kernel

&gt;&gt;&gt; (from stretch-backports), I see several "Wrong magic value

&gt;&gt;&gt; 0x4c414245 in protocol version 101" per day issued by the

&gt;&gt;&gt; secondary, with subsequent termination of the connection,

&gt;&gt;&gt; reconnect and resync. The magic value logged differs, quite often

&gt;&gt;&gt; 0x00.

&gt;&gt;&gt; 

&gt;&gt;&gt; Using the current 4.9.65 kernel (or older) from stretch didn't

&gt;&gt;&gt; show these aborts in the past, and after going back they're gone

&gt;&gt;&gt; again. It seems to be some problem introduced after 4.9 kernels,

&gt;&gt;&gt; since both 4.9 and 4.13 include drbd 8.4.7. Maybe some

&gt;&gt;&gt; interference with the nic driver?

&gt;&gt;&gt; 

&gt;&gt;&gt; Kernel    drbd   ixgbe     errors 4.9.65   8.4.7  4.4.0-k    no 

&gt;&gt;&gt; 4.13.13  8.4.7  5.1.0-k    yes 4.14.2   8.4.10 5.1.0-k    yes

&gt;&gt; 

&gt;&gt; "strange".

&gt;&gt; 

&gt;&gt; What does "lsblk -D" and "lsblk -t" say?

</span>NAME                     ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC

    ROTA SCHED RQ-SIZE  RA WSAME<br>

    <br>

    sda                              0 262144 262144     512     512   

    1 cfq       128 128    0B<br>

    └─sda1                           0 262144 262144     512     512   

    1 cfq       128 128    0B<br>

      ├─local-stresstest             0 262144 262144     512     512   

    1           128 128    0B<br>

      │ └─drbd16                     0 262144 262144     512     512   

    1           128 128    0B<br>

    <br>

    NAME                     DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO<br>

    <br>

    sda                             0        0B       0B         0<br>

    └─sda1                          0        0B       0B         0<br>

      ├─local-stresstest            0        0B       0B         0<br>

      │ └─drbd16                    0        0B       0B         0<br>

    <br>

    <span style="white-space: pre-wrap; display: block; width: 98vw;">&gt;&gt; 

&gt;&gt; 

&gt;&gt; Do you have a scratch volume you can play with? As a datapoint, you

&gt;&gt; try to "blkdiscard /dev/drbdX" it?

</span><br>

    blkdiscard: /dev/drbd16: BLKDISCARD ioctl failed: Operation not

    supported<br>

    It's hosted on LVM on a hardware raid6 disk.<br>

    <br>

    <span style="white-space: pre-wrap; display: block; width: 98vw;">&gt;&gt; 

&gt;&gt; dd if=/dev/zero of=/dev/drbdX bs=1G oflag=direct count=1?

</span>dd if=/dev/zero of=/dev/drbd16 bs=1M count=3072 oflag=direct

    several times gives ~300MB/s and no problem.<br>

    <br>

    This was executed on the primary server with 4.9.65 and the

    secondary 4.14.7 (stretch-backports). Seems that zeroes don't

    trigger the problem.<br>

    <br>

    <br>

    <br>

    <span style="white-space: pre-wrap; display: block; width: 98vw;">&gt; 

&gt; Maybe while preparing the pull requests for upstream, we

&gt; missed/mangled/broke something.

&gt; 

&gt; Can you also reproduce with "out-of-tree" drbd 8.4.10?

</span><br>

    Since my post to drbd-user didn't make it to the list for two weeks,

    I missed the week after christmas when everybody was on holidays, so

    the system is back in full production and I'm uncomfortable with

    doing too much testing.<br>

    <br>

    Regards,<br>

    Andreas<br>

    <br>

    <br>

    <br>

  </body>

</html>