[DRBD-user] drbd blocking IO

gen at pautinanet.com gen at pautinanet.com
Mon Apr 29 14:09:45 CEST 2019


Hi everybody!
I have been used DRBD for many years. Thanks to developers for such a great tool.
But now I have a problem: drbd device is blocking 100% IO for a very long period.
My setup is:
2 nodes (actually 3, but for problem disk it's not used), all is in one room, cable connected, Linstor-managed.
The problem VM is 1 DRBD resource with 2 volumes (I have the same issue with another resouce but with much lesser impact).
VM is a MSSQL server + web - constantly and stable loaded, but not too much. Earlier I have no problem using drbd84 at the same HW for the same VM.
I already had lvm for my VMs, so for DRBD pool I've created a LV in the same VG - vg_system/lv_drbdpool. So LV for DRBD resource is LV over LV:
  LV                VG        Attr       LSize   Pool Origin   Data%  Meta%  Move Log Cpy%Sync Convert
  vm-apb-oper_00    drbdpool  -wi-ao----  50.05g
  vm-apb-oper_01    drbdpool  -wi-ao---- 200.19g
  lv_drbdpool       vg_system -wi-ao---- 700.00g

When I try to sync, VM's IO is hanging up (sometimes for minutes!), although underlying storage device is mostly idle.
11:17:48 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
11:17:49 AM    dev8-0     18.00      0.00    296.00     16.44      0.00      0.11      0.11      0.20
11:17:49 AM dev147-100      9.00      0.00     40.00      4.44      0.00      0.00      0.00      0.00
11:17:49 AM dev147-101      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-105      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-102      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-1005      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-114      0.00      0.00      0.00      0.00      9.00      0.00      0.00    100.00
11:17:49 AM dev147-112      8.00      0.00     32.00      4.00      0.00      0.25      0.25      0.20
11:17:49 AM dev147-115      0.00      0.00      0.00      0.00    128.00      0.00      0.00    100.00
11:17:49 AM dev147-113      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-109      1.00      0.00     80.00     80.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-110      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM dev147-111      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:49 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
11:17:50 AM    dev8-0     12.00      0.00    144.00     12.00      0.00      0.08      0.08      0.10
11:17:50 AM dev147-100     16.00      0.00     80.00      5.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-101      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-105      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-102      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-1005      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-114      0.00      0.00      0.00      0.00      9.00      0.00      0.00    100.00
11:17:50 AM dev147-112      8.00      0.00     47.00      5.88      0.00      0.12      0.12      0.10
11:17:50 AM dev147-115      0.00      0.00      0.00      0.00    128.00      0.00      0.00    100.00
11:17:50 AM dev147-113      5.00      0.00     17.00      3.40      0.00      0.00      0.00      0.00
11:17:50 AM dev147-109      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-110      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
11:17:50 AM dev147-111      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

IO at the other DRBD peer is idle too:
Average:          DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
Average:       dev8-0     32.22      0.00    899.56     27.92      0.01      0.27      0.26      0.83
Average:      dev8-32     33.00      0.00    987.56     29.93      0.01      0.40      0.39      1.28
Average:      dev8-16     34.00      0.00   1137.78     33.46      0.01      0.28      0.26      0.90
Average:      dev8-48     82.67   3084.44   1637.33     57.12      0.32      3.83      0.35      2.91

When I try to disconnect resource, the operation time-outs:
# drbdadm disconnect vm-apb-oper
Command 'drbdsetup disconnect vm-apb-oper 1' did not terminate within 5 seconds

Log shows:
Apr 15 11:28:16 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967216
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 365981593.
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 1329964647 because of state change 161760079
Apr 15 11:28:22 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967215
Apr 15 11:28:28 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967214
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 1329964647.
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 2449354269 because of state change 161760079

and also something like this:
[7382791.698838] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 63
[7382797.699830] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 62
[7382803.700840] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 61
[7382809.701842] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 60
[7382815.702857] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 59
[7382821.703854] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 58
[7382827.704846] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 57
[7382833.706818] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 56
[7382839.707920] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 55
[7382845.708829] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 54


# modinfo drbd
filename:       /lib/modules/3.10.0-957.5.1.el7.x86_64/weak-updates/drbd90/drbd.ko
alias:          block-major-147-*
license:        GPL
version:        9.0.16-1
description:    drbd - Distributed Replicated Block Device v9.0.16-1
author:         Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com>
retpoline:      Y
rhelversion:    7.6

setup for resource:
linstor resource-definition drbd-options --max-buffers 8000 --max-epoch-size 8000 --sndbuf-size 0 --congestion-fill 1048576 --congestion-extents 16000 --c-fill-target 1048576 --c-max-rate 16384 --ko-count 200 --read-balancing least-pending --verify-alg sha1 --unset-disk-barrier --unset-disk-flushes --unset-md-flushes --unset-disk-drain --allow-two-primaries no vm-apb-oper

I've tried different congestion-* and other params for resource but it changes nothing.
If I disconnect the resource, VM functions flawlessly.
How can I fix the problem?


More information about the drbd-user mailing list