[DRBD-user] drbd connection dying badly, ever-rising load, requiring hard machine reset

Fri Sep 30 16:08:59 CEST 2016

Hi everyone!

Sorry for my lengthy mail, but I think I have to include some background ...

We have been using DRBD8-mirrored LVM volumes as filesystems for
virtualized linux guests (formerly OpenVZ with their 2.6.32 kernel, now
LXC with Debian kernel 4.4 to 4.6) for about 8 years.
Currently this pattern runs on 4 similar but different-model pairs of
1HE servers, each featuring 1 or 2 Xeons, 64 to 128 GB of RAM,
Hardware-RAID, direct eth1/eth1 connection reserved for DRBD.

So far we were pretty happy DRBD allowed us not worry about hardware
problems triggering lengthy service interruptions. Thanks Linbit!

Starting last July a very nasty problem came up every now and then, 7
times so far, with no immediate pattern regarding hardware model or so:

First one virtualized guest, presumably during an I/O peak like
rsync-over-ssh of a large directory, becomes unreachable and unsusable.

Simultaneously the system load (i.e. the first number in /proc/loadavg)
starts to rise, slowly (about +1 every 3-5 minutes) but forever, to 1000
(in words: one thousand) and more.

Even with load 1000 the host system and other virtualized guests are
working relatively normally, no problem to use the hosted websites or to
login via ssh.
The load number obviously is somewhat "unreal", which might already be a
hint.

However, it's
- impossible to disconnect the hanging drbd device
- impossible to kill related processes like drbdXX_submit, jbd2/drbdXX-8
- impossible to stop the fallen one or any other virtual machine
- hence impossible to do a clean shutdown or reboot

The only way out is to press the reset button, either physically on site
or virtually using BMC/KVM/IPMI services.

While I'm not entirely sure DRBD is to blame, 6 of 7 cases started with
weird drbd related messages in syslog.

Last case:
Sep 30 14:12:45 host14 kernel: [203230.540687] Oops: 0000 [#1] SMP
Sep 30 14:12:45 host14 kernel: [203230.541997] CPU: 0 PID: 4211 Comm:
drbd_w_bs Tainted: G           OE   4.6.0-0.bpo.1-amd64 #1 Debian
4.6.4-1~bpo8+1
Sep 30 14:12:45 host14 kernel: [203230.542186] RIP:
0010:[<ffffffff81320246>]  [<ffffffff81320246>] memcpy_erms+0x6/0x10
Sep 30 14:12:45 host14 kernel: [203230.542344] RDX: 00000000000003b0
RSI: 0000000000000003 RDI: ffff88080a616040
Sep 30 14:12:45 host14 kernel: [203230.542619] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033
Sep 30 14:12:45 host14 kernel: [203230.542863]  00004000000005b4
00000000000005b4 0000000000000a70 0000000000000a00
Sep 30 14:12:45 host14 kernel: [203230.543108]  [<ffffffffc04fde49>] ?
drbd_send+0xc9/0x1e0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.554230]  [<ffffffffc04fbf50>] ?
drbd_destroy_connection+0xf0/0xf0 [drbd]
Sep 30 14:12:45 host14 kernel: [203230.564960]  [<ffffffff81099df0>] ?
kthread_park+0x50/0x50
Sep 30 14:12:45 host14 kernel: [203230.584805] ---[ end trace
2335d6e97c28a203 ]---

This was the first time "SMP" came up, "drbd" always was "featured".

The full list of 7 cases with syslog excerpts can be found on an
according page in the Wiki of the "OpenSource side" of our company:
https://confluence.clazzes.org/x/CgC2

A subpage there also talks about the debian packages we made using the
upstream DKMS module sources.

Feel free to request an account on clazzes.org, by e-mailing me directly
(self-registration was disabled after spamming accidents).

Does our problem look familiar to anyone?

Our typical DRBD configs include:

# global
common {
  protocol C;

  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
    data-integrity-alg crc32c;
  }
  syncer {
    rate 20M;
    al-extents 257;
    verify-alg crc32c;
  }
}

# resource
disk {
    no-disk-flushes;
    no-md-flushes;
}

Any help would be appreciated.

Regards, Christoph

-- 

Christoph Lechleitner

Geschäftsführung

------------------------------------------------------------------------
ITEG IT-Engineers GmbH | Conradstr. 5, A-6020 Innsbruck
FN 365826f | Handelsgericht Innsbruck | Mobiltelefon: +43 676 3674710
Mail: christoph.lechleitner at iteg.at | Web: http://www.iteg.at/
------------------------------------------------------------------------