[DRBD-user] drbd blocking IO
gen at pautinanet.com
gen at pautinanet.com
Mon Apr 29 14:09:45 CEST 2019
Hi everybody!
I have been used DRBD for many years. Thanks to developers for such a great tool.
But now I have a problem: drbd device is blocking 100% IO for a very long period.
My setup is:
2 nodes (actually 3, but for problem disk it's not used), all is in one room, cable connected, Linstor-managed.
The problem VM is 1 DRBD resource with 2 volumes (I have the same issue with another resouce but with much lesser impact).
VM is a MSSQL server + web - constantly and stable loaded, but not too much. Earlier I have no problem using drbd84 at the same HW for the same VM.
I already had lvm for my VMs, so for DRBD pool I've created a LV in the same VG - vg_system/lv_drbdpool. So LV for DRBD resource is LV over LV:
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
vm-apb-oper_00 drbdpool -wi-ao---- 50.05g
vm-apb-oper_01 drbdpool -wi-ao---- 200.19g
lv_drbdpool vg_system -wi-ao---- 700.00g
When I try to sync, VM's IO is hanging up (sometimes for minutes!), although underlying storage device is mostly idle.
11:17:48 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:17:49 AM dev8-0 18.00 0.00 296.00 16.44 0.00 0.11 0.11 0.20
11:17:49 AM dev147-100 9.00 0.00 40.00 4.44 0.00 0.00 0.00 0.00
11:17:49 AM dev147-101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-105 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-102 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-1005 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-114 0.00 0.00 0.00 0.00 9.00 0.00 0.00 100.00
11:17:49 AM dev147-112 8.00 0.00 32.00 4.00 0.00 0.25 0.25 0.20
11:17:49 AM dev147-115 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00
11:17:49 AM dev147-113 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-109 1.00 0.00 80.00 80.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-110 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM dev147-111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:49 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
11:17:50 AM dev8-0 12.00 0.00 144.00 12.00 0.00 0.08 0.08 0.10
11:17:50 AM dev147-100 16.00 0.00 80.00 5.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-101 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-105 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-102 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-1005 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-114 0.00 0.00 0.00 0.00 9.00 0.00 0.00 100.00
11:17:50 AM dev147-112 8.00 0.00 47.00 5.88 0.00 0.12 0.12 0.10
11:17:50 AM dev147-115 0.00 0.00 0.00 0.00 128.00 0.00 0.00 100.00
11:17:50 AM dev147-113 5.00 0.00 17.00 3.40 0.00 0.00 0.00 0.00
11:17:50 AM dev147-109 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-110 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
11:17:50 AM dev147-111 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
IO at the other DRBD peer is idle too:
Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
Average: dev8-0 32.22 0.00 899.56 27.92 0.01 0.27 0.26 0.83
Average: dev8-32 33.00 0.00 987.56 29.93 0.01 0.40 0.39 1.28
Average: dev8-16 34.00 0.00 1137.78 33.46 0.01 0.28 0.26 0.90
Average: dev8-48 82.67 3084.44 1637.33 57.12 0.32 3.83 0.35 2.91
When I try to disconnect resource, the operation time-outs:
# drbdadm disconnect vm-apb-oper
Command 'drbdsetup disconnect vm-apb-oper 1' did not terminate within 5 seconds
Log shows:
Apr 15 11:28:16 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967216
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 365981593.
Apr 15 11:28:21 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 1329964647 because of state change 161760079
Apr 15 11:28:22 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967215
Apr 15 11:28:28 hyper1 kernel: drbd vm-apb-oper hyper2: [drbd_s_vm-apb-o/7524] sending time expired, ko = 4294967214
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Ignoring P_TWOPC_ABORT packet 1329964647.
Apr 15 11:28:32 hyper1 kernel: drbd vm-apb-oper hyper2: Rejecting concurrent remote state change 2449354269 because of state change 161760079
and also something like this:
[7382791.698838] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 63
[7382797.699830] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 62
[7382803.700840] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 61
[7382809.701842] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 60
[7382815.702857] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 59
[7382821.703854] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 58
[7382827.704846] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 57
[7382833.706818] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 56
[7382839.707920] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 55
[7382845.708829] drbd vm-apb-reports hyper2: [drbd_s_vm-apb-r/7533] sending time expired, ko = 54
# modinfo drbd
filename: /lib/modules/3.10.0-957.5.1.el7.x86_64/weak-updates/drbd90/drbd.ko
alias: block-major-147-*
license: GPL
version: 9.0.16-1
description: drbd - Distributed Replicated Block Device v9.0.16-1
author: Philipp Reisner <phil at linbit.com>, Lars Ellenberg <lars at linbit.com>
retpoline: Y
rhelversion: 7.6
setup for resource:
linstor resource-definition drbd-options --max-buffers 8000 --max-epoch-size 8000 --sndbuf-size 0 --congestion-fill 1048576 --congestion-extents 16000 --c-fill-target 1048576 --c-max-rate 16384 --ko-count 200 --read-balancing least-pending --verify-alg sha1 --unset-disk-barrier --unset-disk-flushes --unset-md-flushes --unset-disk-drain --allow-two-primaries no vm-apb-oper
I've tried different congestion-* and other params for resource but it changes nothing.
If I disconnect the resource, VM functions flawlessly.
How can I fix the problem?
More information about the drbd-user
mailing list