Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Lars! > If you have a two node setup, stay with DRBD 8.4, > you don't gain anything from the new features of DRBD 9, I thought DRBD 9 is much faster than DBRD 8.4? At least I can remember someone mentioned this in a post somewhere. > but, as you found out, still may be hit by the regressions. May I asked how stable you think is DRBD 9 compared to DRBD 8.4? As a developer you have normally a good feeling for this. I really doesn't want to switch my cluster back to 8.4, if there isn't a good reason for this. > I may have worked around this particular bug already, > so it would no longer kill the box in the same situation, > but the underlying issue, why it even would run into this, is unclear. Was this commit http://git.drbd.org/drbd-9.0.git/commitdiff/a67cbb3429858b2e5faeb78dfc96820b516b2952 I am a SW developer also and I know the bugs I did "implement" in the past. So may I ask if you checked the whole driver source for similar situations, where an error check jumped out of the normal control flow still holding a spinlock or mutex? When ever I found such a bug in my drivers, this rang a bell and I was checking other candidates. Sometimes a found a hidden "feature" this way. BR, Jasmin **************************************************************************** On 11/25/2016 04:22 PM, Lars Ellenberg wrote: > On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote: >> Hello, >> >> I have drbd 8.x in production for many years, without problems. >> I have a single primary setup, SAS disks are in hardware RAID5, >> partitions are assigned to drbd. >> On top of drbd, there are virtual machines, which use drbd resources >> as raw disks. >> >> I migrate from an old pair of servers to a new pair of servers and to >> drbd 9.0.5. It went well until last midnight. > > If you have a two node setup, stay with DRBD 8.4, > you don't gain anything from the new features of DRBD 9, > but, as you found out, still may be hit by the regressions. > >> Resource R5 was doing initial synchronization between the two new >> servers, both running drbd 9.0.5. >> Resource R4 probably received some read/write load on the primary >> node, as it ran nightly backup. >> The bug came on the server that was the target of the synchronisation, >> and drbd secondary. >> >> OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled >> drbd 9.0.5 myself from the tarball. >> >> I'd like to know if this is a drbd bug, or a hardware issue? > > It likely is something in DRBD 9. > >> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776] > drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags) > FAILED in drbd_try_rs_begin_io > >> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut here ]------------ >> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571! > > I may have worked around this particular bug already, > so it would no longer kill the box in the same situation, > but the underlying issue, why it even would run into this, is unclear. > > Anyways, this is a kernel thread hitting a BUG() while holding a > spinlock, with irq disabled. > > Any further misbehavior is coming from there. > >> This task (drbd_a_r4) stayed with us until reboot, and kept 1 core at >> 100% CPU Load: > > an other thread trying to grab the spinlock, > which was never released by the BUG()ed and destroyed kernel thread. > >> The secondary was available on the network, but it was sluggish to work with. >> >> When I tried to reboot next morning, the reboot stalled, it couldn't >> deconfigure the network, so I had to reboot using the power button... > > Yes, that would now be pretty much expected. > I'd probably have done a > for x in s u b ; do echo $x > /proc/sysrq-trigger ; sleep 1; done > I'd even recommend to set these sysctls to have any server reboot itself > if it triggers a BUG(). kernel.panic_on_oops = 1, kernel.panic = 30 > >