Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, While somewhat similar to the issues I raised 3 years ago in the 'recovery from "page allocation failure"' thread, this seems to be an entirely different beast under the hood. Some facts first: 2 node Debian Wheezy cluster with a custom 4.1.6 kernel (the newest at build time) and DRBD 8.4.5. 128GB RAM, all "disks" are Intel DC SSD models. The use case is a dovecot mailbox cluster, with up to 65k IMAP sessions and thus processes per node, RAM usage is about 85GB for processes, 20GB for SLAB (dir_entry, inodes) and the rest page cache. At some times DRBD tends to have hiccups like this: --- node a--- Apr 11 01:40:55 mbx09 kernel: [50024550.377668] drbd mb09: sock was shut down by peer Apr 11 01:40:55 mbx09 kernel: [50024550.377682] drbd mb09: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) Apr 11 01:40:55 mbx09 kernel: [50024550.377685] drbd mb09: short read (expected size 16) Apr 11 01:40:55 mbx09 kernel: [50024550.377692] drbd mb09: asender terminated Apr 11 01:40:55 mbx09 kernel: [50024550.377694] drbd mb09: Terminating drbd_a_mb09 --- node b--- Apr 11 01:40:55 mbx10 kernel: [50031643.361001] drbd mb09: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Apr 11 01:40:55 mbx10 kernel: [50031643.361010] drbd mb09: connection_finish_peer_reqs() failed Apr 11 01:40:55 mbx10 kernel: [50031643.361605] drbd mb09: asender terminated Apr 11 01:40:55 mbx10 kernel: [50031643.361609] drbd mb09: Terminating drbd_a_mb09 Apr 11 01:40:55 mbx10 kernel: [50031643.417112] drbd mb09: Connection closed --- This may continue with broken pipes, timeouts and general inability to regain a consistent and up-to-date state. Though sometimes (hours later!) it may untwist itself w/o any manual help. When this is happening, the systems were always extremely lightly loaded (no suspicious CPU usage), the block storage (MD RAID and underlying SSDS) were bored and the network/replication link in perfect health. Nothing relevant in the logs either, unlike the "allocation failures" seen years ago (and not since, with recent kernels/DRBD). In short, nothing obvious that would expect these timeouts. However due to my past encounters I already had vm/min_free_kbytes set to 1GB and by force of habit as well experiences did drop the pagecache and issued a compact_memory, which indeed resolved the DRBD issues. Last night I dug a bit deeper and ran "iostat -x". Lo and behold, while any and all actual devices (MD, SSDs) were as bored as usual, the affected DRBD device was fluctuating between 20-100% utilization. Again, this is NOT the network link, as there are 2 DRBD resources (one primary on each node) and the other one was not affected at all. Dropping the pagecache (even w/o compacting memory) immediately resolved things and the DRBD utilization according to iostat went back to the normal 1-2%. So we're obviously looking at a memory fragmentation/allocation issue, however since no component actually logged a protest anywhere I can't be certain if this within DRBD (having only 1 out of 2 resources affected hints to this), the ever suspicious MM stack or the network drivers (very unlikely IMHO). If this sound familiar, I'd appreciate all feedback. FWIW, the users/data are being migrated to a new cluster with twice the RAM and Jessie, 4.9 kernel, DRBD 8.4.7, the above cluster will be upgraded to that level afterwards. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/