Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, On Thu, Dec 1, 2016 at 5:45 PM, Lars Ellenberg <lars.ellenberg at linbit.com> wrote: > On Fri, Nov 25, 2016 at 04:22:45PM +0100, Lars Ellenberg wrote: >> On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote: > > Laszlo, > I finally found time to look at this in detail again and: > >> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776] >> drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags) >> FAILED in drbd_try_rs_begin_io >> >> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut here ]------------ >> > Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571! >> >> I may have worked around this particular bug already, > > Nope, this one is not yet understood. > > We know where it crashes, > but I'd like to understand how to provoke it. > > I'd like to know what happened *before* the first ASSERTION and BUG, > what events lead up to this. > > Do you still happen to have the logs, Yes. > or can recall what you did to "make it crash" in this way? I migrated from an old pair of servers (drbd8) to a new pair of servers. Only the two new servers were involved in replication at the time of the crash. Both of them were running drbd 9.0.5 at that time. All resources were primary on hubud-s-apl0001, and all were secondary on hufot-s-apl0004. Resource R5 was doing initial synchronization between the two new servers, SyncSource was hubud-s-apl0001, SyncTarget was hufot-s-apl0004. Resource R4 probably received some read/write load on the primary node hubud-s-apl0001, as it ran nightly backup. The bug came on the server that was the target of the synchronisation, and drbd secondary, hufot-s-apl0004. OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled drbd 9.0.5 myself from the tarball. The configuration is a bit weird, the global_common.conf contains parameters optimized for wan replication with pull ahead, and variable-rate synchronization, without a drbd-proxy. However at the time of the crash, both nodes were on the same LAN with Gigabit Ethernet connections, and a per-resource defined disk section to speed up replication on the LAN. This was a temporary setup to replicate some TBytes of data to the secondary, before we transport it to the other location that is connected via a WAN link (30 Mbps). The servers are now running drbd 8.4.9, they are on their final location, in production. The logs from both nodes and configuration is available here: hXXps://drive.google.com/open?id=0Bwc15pKrQgubbi1tTFAzSUk4Vzg Thank you for looking into this. Best regards, Laszlo Fiat