[DRBD-user] DRBD, RHEL 7.5, kernel panic in nla_parse, or "Failure: (126) UnknownMandatoryTag"

Lars Ellenberg lars.ellenberg at linbit.com
Fri Apr 20 00:16:16 CEST 2018


A quick heads up for all people using DRBD on RHEL 7.

TL;DR:
If using a RHEL 7.5 kernel (3.10.0-862.el7.x86_64), make sure you use a
DRBD module compiled against matching kernel headers, DO NOT use a DRBD
module compiled against 3.10.0-693* (7.4) or older.
Otherwise, best case it fails,
worst case: kernel panic on first "drbdadm up".

More details:

The new RHEL 7.5 kernel broke kABI in a way that is
not detected by the usual means (symbol versioning checksums).

The "weak-modules" magic will thus think an older module
would be kABI compatible, and symlink it in place,
while in reality it is binary incompatible.

The incompatibility in this case is in the configuration interface
we use ("netlink", more specifically "genetlink").

If you had a module compiled against a 7.4 kernel installed,
(kmod-drbd ... 3.10.0-693 something), it would now be loaded
into the 7.5 kernel (3.10.0-862 something), 
the module would call into the kernel to "parse" and "validate"
the configuration requests sent by drbdsetup,
but the kernel functions to do so would now try to do that
in a binary incompatible way.

If you are lucky, they just "cannot parse" it,
and reject the drbdsetup requests with some strange error message:

-----------------------------------------------------------------
[root at alice7 drbd]# drbdsetup new-resource r0 0 --auto-promote=no
r0: Failure: (126) UnknownMandatoryTag
additional info from kernel:
invalid attribute value
-----------------------------------------------------------------

If you are unlucky, parts of the "drbdadm up" will succeed,
and then cause a kernel panic once it reaches the "new-peer" stage:

-----------------------------------------------------------------
[root at alice7 drbd]# drbdsetup new-resource r0 0
[root at alice7 drbd]# drbdsetup new-peer r0 1 --_name=bob7
...
kernel BUG at lib/nlattr.c:66!
RIP: 0010:[<ffffffff86176ae0>]  [<ffffffff86176ae0>] validate_nla+0x230/0x240
...
Call Trace:
 [<ffffffff86176c26>] nla_parse+0xb6/0x120
 [<ffffffffc07c2173>] drbd_nla_parse_nested+0x43/0x50 [drbd]
 [<ffffffffc07a54b7>] __net_conf_from_attrs.isra.59+0x37/0x940 [drbd]
 [<ffffffffc07b315d>] adm_new_connection+0x9d/0x9c0 [drbd]
...
------------------------------------------------------------------

If you upgrade to a RHEL 7.5 kernel,
please make sure you use a DRBD kernel module compiled against
a RHEL 7.5 kernel-devel package.

Kernel packages provided by LINBIT include the kernel version tag in the
"rpm release" part of the package:
kmod-drbd-8.4.10_3.10.0_862-1.el7.x86_64.rpm
kmod-drbd-9.0.13_3.10.0_862-1.el7.x86_64.rpm

Starting with 9.0.13 and 8.4.11, we have an additional runtime check for
this particular incompatibility.  Which means that if you try to load
the modules provided by kmod-drbd-9.0.13_3.10.0_693.21.1-1.el7.x86_64.rpm
into the  3.10.0-862 kernel:

-----------------------------------------------------------------
[root at alice7 drbd]# uname -r
3.10.0-862.el7.x86_64
[root at alice7 drbd]# modinfo drbd
...
vermagic: 3.10.0-693.21.1.el7.x86_64 SMP mod_unload modversions
...
     /lib/modules/3.10.0-862.el7.x86_64/weak-updates/drbd.ko
  -> /lib/modules/3.10.0-693.21.1.el7.x86_64/updates/drbd.ko

[root at alice7 drbd]# modprobe drbd
modprobe: ERROR: could not insert 'drbd': Invalid argument

[root at alice7 drbd]# dmesg | tail
...
 drbd: kernel disagrees about the layout of struct nla_policy (12)
 drbd: kABI breakage detected! module compiled for: 3.10.0-693.21.1.el7.x86_64
-----------------------------------------------------------------

Hope this helps to avoid a few spurious kernel panics
and WTF moments out there.



For those interested, the change was a backport in
struct nla_policy {
        u16             type;
        u16             len;
        RH_KABI_EXTEND(void *validation_data)
};

To parse and validate (generic) netlink packets,
an *array* of struct nla_policy elements is referenced.
RH_KABI_EXTEND() annotates the change, but also "hides"
this incompatible incompatible change
from the symbol version checksum magic.

The old module presents an array of
struct nla_policy { u16 type; u16 len; } policy[] = { { ... }, { ... } }
the new kernel expects the array elements to be three times that size,
and does out-of-bounds access, which result in "undefined" behavior.


Cheers,

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R&D, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT


More information about the drbd-user mailing list