[DRBD-user] DRBD versus memory fragmentation

Christian Balzer chibi at gol.com
Wed May 10 05:01:18 CEST 2017

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

any insights on the issue below?

As mentioned, I now have a 2nd cluster up and running with a 4.9 kernel and
DRBD 8.4.7.
And after about 30 days of operation this raised its ugly head again:
---
[3526901.689492] block drbd0: Remote failed to finish a request within 60444ms > ko-count (10) * timeout (60 * 0.1s)
[3526901.689516] drbd mb11: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) 
---

Thankfully this recovered by itself, but not w/o delaying things
noticably from dovecot's perspective. 
Again, this was at 07:43 local time, very little activity on either node.
Before somebody says hugepages defrag, not in use here.

The node which failed to respond in time had again pretty badly fragmented
memory:
---
# cat /proc/buddyinfo 
Node 0, zone      DMA      0      1      0      0      2      1      1      0      1      1      3 
Node 0, zone    DMA32  40598  28249  10071    601     24      0      0      0      0      0      0 
Node 0, zone   Normal 374754  67419    456     85      0      0      0      0      0      0      0 
Node 1, zone   Normal 1069436 584167 151282  13749   1570      1      0      0      0      0      0 
# echo 1 > /proc/sys/vm/compact_memory 
# cat /proc/buddyinfo 
Node 0, zone      DMA      0      1      0      0      2      1      1      0      1      1      3 
Node 0, zone    DMA32  11856   7148   3334   1998    895    644    382    184     26      0      0 
Node 0, zone   Normal 269437  27100   5996   4188   2852   1422    513    155     46     66     35 
Node 1, zone   Normal  52310  62435  22325   7498   5673   4410   3096   1978   1085    949   1152
---

I simply can't believe or accept that manually dropping caches and
compacting memory is required to run a stable DRBD cluster in this day
and age.

Christian

On Tue, 11 Apr 2017 11:38:07 +0900 Christian Balzer wrote:

> Hello,
> 
> While somewhat similar to the issues I raised 3 years ago in the 
> 'recovery from "page allocation failure"' thread, this seems to be an
> entirely different beast under the hood.
> 
> Some facts first:
> 2 node Debian Wheezy cluster with a custom 4.1.6 kernel (the newest at
> build time) and DRBD 8.4.5.
> 128GB RAM, all "disks" are Intel DC SSD models.
> The use case is a dovecot mailbox cluster, with up to 65k IMAP sessions
> and thus processes per node, RAM usage is about 85GB for processes, 20GB
> for SLAB (dir_entry, inodes) and the rest page cache.
> 
> At some times DRBD tends to have hiccups like this:
> --- node a---
> Apr 11 01:40:55 mbx09 kernel: [50024550.377668] drbd mb09: sock was shut down by peer
> Apr 11 01:40:55 mbx09 kernel: [50024550.377682] drbd mb09: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) 
> Apr 11 01:40:55 mbx09 kernel: [50024550.377685] drbd mb09: short read (expected size 16)
> Apr 11 01:40:55 mbx09 kernel: [50024550.377692] drbd mb09: asender terminated
> Apr 11 01:40:55 mbx09 kernel: [50024550.377694] drbd mb09: Terminating drbd_a_mb09
> --- node b---
> Apr 11 01:40:55 mbx10 kernel: [50031643.361001] drbd mb09: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) 
> Apr 11 01:40:55 mbx10 kernel: [50031643.361010] drbd mb09: connection_finish_peer_reqs() failed
> Apr 11 01:40:55 mbx10 kernel: [50031643.361605] drbd mb09: asender terminated
> Apr 11 01:40:55 mbx10 kernel: [50031643.361609] drbd mb09: Terminating drbd_a_mb09
> Apr 11 01:40:55 mbx10 kernel: [50031643.417112] drbd mb09: Connection closed
> ---
> 
> This may continue with broken pipes, timeouts and general inability to
> regain a consistent and up-to-date state.
> Though sometimes (hours later!) it may untwist itself w/o any manual help.
> 
> When this is happening, the systems were always extremely lightly loaded
> (no suspicious CPU usage), the block storage (MD RAID and underlying
> SSDS) were bored and the network/replication link in perfect health. 
> Nothing relevant in the logs either, unlike the "allocation failures" seen
> years ago (and not since, with recent kernels/DRBD). 
> In short, nothing obvious that would expect these timeouts.
> 
> However due to my past encounters I already had vm/min_free_kbytes set to
> 1GB and by force of habit as well experiences did drop the pagecache and
> issued a compact_memory, which indeed resolved the DRBD issues.
> 
> Last night I dug a bit deeper and ran "iostat -x". Lo and behold, while
> any and all actual devices  (MD, SSDs) were as bored as usual, the
> affected DRBD device was fluctuating between 20-100% utilization. 
> Again, this is NOT the network link, as there are 2 DRBD resources (one
> primary on each node) and the other one was not affected at all.
> Dropping the pagecache (even w/o compacting memory) immediately resolved
> things and the DRBD utilization according to iostat went back to the
> normal 1-2%. 
> So we're obviously looking at a memory fragmentation/allocation issue,
> however since no component actually logged a protest anywhere I can't be
> certain if this within DRBD (having only 1 out of 2 resources affected
> hints to this), the ever suspicious MM stack or the network drivers (very
> unlikely IMHO).
> 
> If this sound familiar, I'd appreciate all feedback.
> 
> FWIW, the users/data are being migrated to a new cluster with twice the RAM
> and Jessie, 4.9 kernel, DRBD 8.4.7, the above cluster will be upgraded to
> that level afterwards.
> 
> Regards,
> 
> Christian


-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/



More information about the drbd-user mailing list