[DRBD-user] DRBD versus memory fragmentation

Tue Apr 11 04:38:07 CEST 2017

Hello,

While somewhat similar to the issues I raised 3 years ago in the 
'recovery from "page allocation failure"' thread, this seems to be an
entirely different beast under the hood.

Some facts first:
2 node Debian Wheezy cluster with a custom 4.1.6 kernel (the newest at
build time) and DRBD 8.4.5.
128GB RAM, all "disks" are Intel DC SSD models.
The use case is a dovecot mailbox cluster, with up to 65k IMAP sessions
and thus processes per node, RAM usage is about 85GB for processes, 20GB
for SLAB (dir_entry, inodes) and the rest page cache.

At some times DRBD tends to have hiccups like this:
--- node a---
Apr 11 01:40:55 mbx09 kernel: [50024550.377668] drbd mb09: sock was shut down by peer
Apr 11 01:40:55 mbx09 kernel: [50024550.377682] drbd mb09: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
Apr 11 01:40:55 mbx09 kernel: [50024550.377685] drbd mb09: short read (expected size 16)
Apr 11 01:40:55 mbx09 kernel: [50024550.377692] drbd mb09: asender terminated
Apr 11 01:40:55 mbx09 kernel: [50024550.377694] drbd mb09: Terminating drbd_a_mb09
--- node b---
Apr 11 01:40:55 mbx10 kernel: [50031643.361001] drbd mb09: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) 
Apr 11 01:40:55 mbx10 kernel: [50031643.361010] drbd mb09: connection_finish_peer_reqs() failed
Apr 11 01:40:55 mbx10 kernel: [50031643.361605] drbd mb09: asender terminated
Apr 11 01:40:55 mbx10 kernel: [50031643.361609] drbd mb09: Terminating drbd_a_mb09
Apr 11 01:40:55 mbx10 kernel: [50031643.417112] drbd mb09: Connection closed
---

This may continue with broken pipes, timeouts and general inability to
regain a consistent and up-to-date state.
Though sometimes (hours later!) it may untwist itself w/o any manual help.

When this is happening, the systems were always extremely lightly loaded
(no suspicious CPU usage), the block storage (MD RAID and underlying
SSDS) were bored and the network/replication link in perfect health. 
Nothing relevant in the logs either, unlike the "allocation failures" seen
years ago (and not since, with recent kernels/DRBD). 
In short, nothing obvious that would expect these timeouts.

However due to my past encounters I already had vm/min_free_kbytes set to
1GB and by force of habit as well experiences did drop the pagecache and
issued a compact_memory, which indeed resolved the DRBD issues.

Last night I dug a bit deeper and ran "iostat -x". Lo and behold, while
any and all actual devices  (MD, SSDs) were as bored as usual, the
affected DRBD device was fluctuating between 20-100% utilization. 
Again, this is NOT the network link, as there are 2 DRBD resources (one
primary on each node) and the other one was not affected at all.
Dropping the pagecache (even w/o compacting memory) immediately resolved
things and the DRBD utilization according to iostat went back to the
normal 1-2%. 
So we're obviously looking at a memory fragmentation/allocation issue,
however since no component actually logged a protest anywhere I can't be
certain if this within DRBD (having only 1 out of 2 resources affected
hints to this), the ever suspicious MM stack or the network drivers (very
unlikely IMHO).

If this sound familiar, I'd appreciate all feedback.

FWIW, the users/data are being migrated to a new cluster with twice the RAM
and Jessie, 4.9 kernel, DRBD 8.4.7, the above cluster will be upgraded to
that level afterwards.

Regards,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/