[DRBD-user] Fwd: kernel BUG on new server with drbd 9.0.5

Fri Dec 2 14:27:08 CET 2016

Hello,

On Thu, Dec 1, 2016 at 5:45 PM, Lars Ellenberg
<lars.ellenberg at linbit.com> wrote:
> On Fri, Nov 25, 2016 at 04:22:45PM +0100, Lars Ellenberg wrote:
>> On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote:
>
> Laszlo,
> I finally found time to look at this in detail again and:
>
>> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776]
>>       drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags)
>>               FAILED in drbd_try_rs_begin_io
>>
>> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut here ]------------
>> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571!
>>
>> I may have worked around this particular bug already,
>
> Nope, this one is not yet understood.
>
> We know where it crashes,
> but I'd like to understand how to provoke it.
>
> I'd like to know what happened *before* the first ASSERTION and BUG,
> what events lead up to this.
>
> Do you still happen to have the logs,

Yes.

> or can recall what you did to "make it crash" in this way?

I migrated from an old pair of servers (drbd8) to a new pair of servers.
Only the two new servers were involved in replication at the time
of the crash. Both of them were running drbd 9.0.5 at that time.
All resources were primary on hubud-s-apl0001, and all were secondary
on hufot-s-apl0004.

Resource R5 was doing initial synchronization between the two new
servers, SyncSource was hubud-s-apl0001, SyncTarget was hufot-s-apl0004.
Resource R4 probably received some read/write load on the primary
node hubud-s-apl0001, as it ran nightly backup.
The bug came on the server that was the target of the synchronisation,
and drbd secondary, hufot-s-apl0004.

OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled
drbd 9.0.5 myself from the tarball.

The configuration is a bit weird, the global_common.conf contains parameters
optimized for wan replication with pull ahead, and variable-rate
synchronization,
without a drbd-proxy.
However at the time of the crash, both nodes were on the same LAN with
Gigabit Ethernet
connections, and a per-resource defined disk section to speed up
replication on the LAN.

This was a temporary setup to replicate some TBytes of data to the
secondary, before we
transport it to the other location that is connected via a WAN link (30 Mbps).

The servers are now running drbd 8.4.9, they are on their final
location, in production.

The logs from both nodes and configuration is available here:

hXXps://drive.google.com/open?id=0Bwc15pKrQgubbi1tTFAzSUk4Vzg

Thank you for looking into this.

Best regards,

Laszlo Fiat