Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, any insights on the issue below? As mentioned, I now have a 2nd cluster up and running with a 4.9 kernel and DRBD 8.4.7. And after about 30 days of operation this raised its ugly head again: --- [3526901.689492] block drbd0: Remote failed to finish a request within 60444ms > ko-count (10) * timeout (60 * 0.1s) [3526901.689516] drbd mb11: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) --- Thankfully this recovered by itself, but not w/o delaying things noticably from dovecot's perspective. Again, this was at 07:43 local time, very little activity on either node. Before somebody says hugepages defrag, not in use here. The node which failed to respond in time had again pretty badly fragmented memory: --- # cat /proc/buddyinfo Node 0, zone DMA 0 1 0 0 2 1 1 0 1 1 3 Node 0, zone DMA32 40598 28249 10071 601 24 0 0 0 0 0 0 Node 0, zone Normal 374754 67419 456 85 0 0 0 0 0 0 0 Node 1, zone Normal 1069436 584167 151282 13749 1570 1 0 0 0 0 0 # echo 1 > /proc/sys/vm/compact_memory # cat /proc/buddyinfo Node 0, zone DMA 0 1 0 0 2 1 1 0 1 1 3 Node 0, zone DMA32 11856 7148 3334 1998 895 644 382 184 26 0 0 Node 0, zone Normal 269437 27100 5996 4188 2852 1422 513 155 46 66 35 Node 1, zone Normal 52310 62435 22325 7498 5673 4410 3096 1978 1085 949 1152 --- I simply can't believe or accept that manually dropping caches and compacting memory is required to run a stable DRBD cluster in this day and age. Christian On Tue, 11 Apr 2017 11:38:07 +0900 Christian Balzer wrote: > Hello, > > While somewhat similar to the issues I raised 3 years ago in the > 'recovery from "page allocation failure"' thread, this seems to be an > entirely different beast under the hood. > > Some facts first: > 2 node Debian Wheezy cluster with a custom 4.1.6 kernel (the newest at > build time) and DRBD 8.4.5. > 128GB RAM, all "disks" are Intel DC SSD models. > The use case is a dovecot mailbox cluster, with up to 65k IMAP sessions > and thus processes per node, RAM usage is about 85GB for processes, 20GB > for SLAB (dir_entry, inodes) and the rest page cache. > > At some times DRBD tends to have hiccups like this: > --- node a--- > Apr 11 01:40:55 mbx09 kernel: [50024550.377668] drbd mb09: sock was shut down by peer > Apr 11 01:40:55 mbx09 kernel: [50024550.377682] drbd mb09: peer( Secondary -> Unknown ) conn( Connected -> BrokenPipe ) pdsk( UpToDate -> DUnknown ) > Apr 11 01:40:55 mbx09 kernel: [50024550.377685] drbd mb09: short read (expected size 16) > Apr 11 01:40:55 mbx09 kernel: [50024550.377692] drbd mb09: asender terminated > Apr 11 01:40:55 mbx09 kernel: [50024550.377694] drbd mb09: Terminating drbd_a_mb09 > --- node b--- > Apr 11 01:40:55 mbx10 kernel: [50031643.361001] drbd mb09: peer( Primary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) > Apr 11 01:40:55 mbx10 kernel: [50031643.361010] drbd mb09: connection_finish_peer_reqs() failed > Apr 11 01:40:55 mbx10 kernel: [50031643.361605] drbd mb09: asender terminated > Apr 11 01:40:55 mbx10 kernel: [50031643.361609] drbd mb09: Terminating drbd_a_mb09 > Apr 11 01:40:55 mbx10 kernel: [50031643.417112] drbd mb09: Connection closed > --- > > This may continue with broken pipes, timeouts and general inability to > regain a consistent and up-to-date state. > Though sometimes (hours later!) it may untwist itself w/o any manual help. > > When this is happening, the systems were always extremely lightly loaded > (no suspicious CPU usage), the block storage (MD RAID and underlying > SSDS) were bored and the network/replication link in perfect health. > Nothing relevant in the logs either, unlike the "allocation failures" seen > years ago (and not since, with recent kernels/DRBD). > In short, nothing obvious that would expect these timeouts. > > However due to my past encounters I already had vm/min_free_kbytes set to > 1GB and by force of habit as well experiences did drop the pagecache and > issued a compact_memory, which indeed resolved the DRBD issues. > > Last night I dug a bit deeper and ran "iostat -x". Lo and behold, while > any and all actual devices (MD, SSDs) were as bored as usual, the > affected DRBD device was fluctuating between 20-100% utilization. > Again, this is NOT the network link, as there are 2 DRBD resources (one > primary on each node) and the other one was not affected at all. > Dropping the pagecache (even w/o compacting memory) immediately resolved > things and the DRBD utilization according to iostat went back to the > normal 1-2%. > So we're obviously looking at a memory fragmentation/allocation issue, > however since no component actually logged a protest anywhere I can't be > certain if this within DRBD (having only 1 out of 2 resources affected > hints to this), the ever suspicious MM stack or the network drivers (very > unlikely IMHO). > > If this sound familiar, I'd appreciate all feedback. > > FWIW, the users/data are being migrated to a new cluster with twice the RAM > and Jessie, 4.9 kernel, DRBD 8.4.7, the above cluster will be upgraded to > that level afterwards. > > Regards, > > Christian -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/