[DRBD-user] linux-4.11/drbd-8.4: drbd device stuck at "100% utilization" with no reads/write going to underlying device
lvml at 5t9.de
Wed May 10 21:39:09 CEST 2017
I recently tried to update a pair of servers from linux-4.9 to linux-4.11 that
use DRBD 8.4.x (as included by the mainline linux kernel from kernel.org).
The "secondary" server had been running linux-4.11 for some time without any
issues. Both servers realize 3 drbd devices, two of which contain (separate)
btrfs filesystems, 1 of which contains an XFS filesystem. Between the drbd
devices and the filesystem there is one layer of dm-crypt block device.
This setup has been used for years (with kernel updates from time to time).
After updating the primary server to linux-4.11, I experienced the following
issue that forced me to revert it to linux-4.9:
Soon after user processes cause significant read-load plus a little write
load to one of the btrfs filesystems, one can observe how the "utilization %"
as displayed by "iostat -dx 3", which is usually of similar value on the
drbd device and on the underlying physical disk, becomes different between
the two devices: The drbd device is nailed to "100% utilization", while
the physical device becomes idle. "dirty pages" and "writeback" - as displayed by
"cat /proc/meminfo" - no longer are written to the physical device.
There are no errors - no I/O errors, no strange messages, just the fact that
more and more "dirty data" accumulates, leading to more and more processes
sleeping in "D"-state. If I kill all processes that do I/O to the btrfs filesystem,
the same amount of "dirty data" sits there unflushed forever (while on other
devices, normal writing still occurs.)
At this point, no "sync", "umount" or such will finish, and of course a
soft reboot also hangs.
The symptom is not specific to any one of the filesystems - if some I/O-load
is applied, any of them (also multiple at the same time) get into this
"100% utilization stuck"-state.
This symptom occurs even if there is no secondary DRBD server to connect to,
so it is probably unrelated to any network activities of DRBD.
In the kernel .config, CONFIG_BLK_WBT=y - but I tested with both WBT turned
on and off at runtime, the symptom occurs under both conditions.
Any ideas what might go wrong here?
More information about the drbd-user