[DRBD-user] Fwd: kernel BUG on new server with drbd 9.0.5

Fri Nov 25 16:22:45 CET 2016

On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote:
>  Hello,
> 
> I have drbd 8.x in production for many years, without problems.
> I have a single primary setup, SAS disks are in hardware RAID5,
> partitions are assigned to drbd.
> On top of drbd, there are virtual machines, which use drbd resources
> as raw disks.
> 
> I migrate from an old pair of servers to a new pair of servers and to
> drbd 9.0.5. It went well until last midnight.

If you have a two node setup, stay with DRBD 8.4,
you don't gain anything from the new features of DRBD 9,
but, as you found out, still may be hit by the regressions.

> Resource R5 was doing initial synchronization between the two new
> servers, both running drbd 9.0.5.
> Resource R4 probably received some read/write load on the primary
> node, as it ran nightly backup.
> The bug came on the server that was the target of the synchronisation,
> and drbd secondary.
> 
> OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled
> drbd 9.0.5 myself from the tarball.
> 
> I'd like to know if this is a drbd bug, or a hardware issue?

It likely is something in DRBD 9.

> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776]
	drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags)
		FAILED in drbd_try_rs_begin_io

> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut here ]------------
> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571!

I may have worked around this particular bug already,
so it would no longer kill the box in the same situation,
but the underlying issue, why it even would run into this, is unclear.

Anyways, this is a kernel thread hitting a BUG() while holding a
spinlock, with irq disabled.

Any further misbehavior is coming from there.

> This task (drbd_a_r4) stayed with us until reboot, and kept 1 core at
> 100% CPU Load:

an other thread trying to grab the spinlock,
which was never released by the BUG()ed and destroyed kernel thread.

> The secondary was available on the network, but it was sluggish to work with.
> 
> When I tried to reboot next morning, the reboot stalled, it couldn't
> deconfigure the network, so I had to reboot using the power button...

Yes, that would now be pretty much expected.
I'd probably have done a
	for x in s u b ; do echo $x > /proc/sysrq-trigger ; sleep 1; done
I'd even recommend to set these sysctls to have any server reboot itself
if it triggers a BUG().  kernel.panic_on_oops = 1, kernel.panic = 30

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed