Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi everyone! Sorry for my lengthy mail, but I think I have to include some background ... We have been using DRBD8-mirrored LVM volumes as filesystems for virtualized linux guests (formerly OpenVZ with their 2.6.32 kernel, now LXC with Debian kernel 4.4 to 4.6) for about 8 years. Currently this pattern runs on 4 similar but different-model pairs of 1HE servers, each featuring 1 or 2 Xeons, 64 to 128 GB of RAM, Hardware-RAID, direct eth1/eth1 connection reserved for DRBD. So far we were pretty happy DRBD allowed us not worry about hardware problems triggering lengthy service interruptions. Thanks Linbit! Starting last July a very nasty problem came up every now and then, 7 times so far, with no immediate pattern regarding hardware model or so: First one virtualized guest, presumably during an I/O peak like rsync-over-ssh of a large directory, becomes unreachable and unsusable. Simultaneously the system load (i.e. the first number in /proc/loadavg) starts to rise, slowly (about +1 every 3-5 minutes) but forever, to 1000 (in words: one thousand) and more. Even with load 1000 the host system and other virtualized guests are working relatively normally, no problem to use the hosted websites or to login via ssh. The load number obviously is somewhat "unreal", which might already be a hint. However, it's - impossible to disconnect the hanging drbd device - impossible to kill related processes like drbdXX_submit, jbd2/drbdXX-8 - impossible to stop the fallen one or any other virtual machine - hence impossible to do a clean shutdown or reboot The only way out is to press the reset button, either physically on site or virtually using BMC/KVM/IPMI services. While I'm not entirely sure DRBD is to blame, 6 of 7 cases started with weird drbd related messages in syslog. Last case: Sep 30 14:12:45 host14 kernel: [203230.540687] Oops: 0000 [#1] SMP Sep 30 14:12:45 host14 kernel: [203230.541997] CPU: 0 PID: 4211 Comm: drbd_w_bs Tainted: G OE 4.6.0-0.bpo.1-amd64 #1 Debian 4.6.4-1~bpo8+1 Sep 30 14:12:45 host14 kernel: [203230.542186] RIP: 0010:[<ffffffff81320246>] [<ffffffff81320246>] memcpy_erms+0x6/0x10 Sep 30 14:12:45 host14 kernel: [203230.542344] RDX: 00000000000003b0 RSI: 0000000000000003 RDI: ffff88080a616040 Sep 30 14:12:45 host14 kernel: [203230.542619] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Sep 30 14:12:45 host14 kernel: [203230.542863] 00004000000005b4 00000000000005b4 0000000000000a70 0000000000000a00 Sep 30 14:12:45 host14 kernel: [203230.543108] [<ffffffffc04fde49>] ? drbd_send+0xc9/0x1e0 [drbd] Sep 30 14:12:45 host14 kernel: [203230.554230] [<ffffffffc04fbf50>] ? drbd_destroy_connection+0xf0/0xf0 [drbd] Sep 30 14:12:45 host14 kernel: [203230.564960] [<ffffffff81099df0>] ? kthread_park+0x50/0x50 Sep 30 14:12:45 host14 kernel: [203230.584805] ---[ end trace 2335d6e97c28a203 ]--- This was the first time "SMP" came up, "drbd" always was "featured". The full list of 7 cases with syslog excerpts can be found on an according page in the Wiki of the "OpenSource side" of our company: https://confluence.clazzes.org/x/CgC2 A subpage there also talks about the debian packages we made using the upstream DKMS module sources. Feel free to request an account on clazzes.org, by e-mailing me directly (self-registration was disabled after spamming accidents). Does our problem look familiar to anyone? Our typical DRBD configs include: # global common { protocol C; net { after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; data-integrity-alg crc32c; } syncer { rate 20M; al-extents 257; verify-alg crc32c; } } # resource disk { no-disk-flushes; no-md-flushes; } Any help would be appreciated. Regards, Christoph -- Christoph Lechleitner Geschäftsführung ------------------------------------------------------------------------ ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck FN 365826f | Handelsgericht Innsbruck | Mobiltelefon: +43 676 3674710 Mail: christoph.lechleitner at iteg.at | Web: http://www.iteg.at/ ------------------------------------------------------------------------