Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello Lars! > http://git.drbd.org/drbd-9.0.git/commitdiff/a67cbb3429858b2e5faeb78dfc96820b516b2952 I just reviewed your commit and I have a suggestion to omit the "have_mutex" variable and two "if (have_mutex)" similar to the "disconnect_rcu_unlock" label in function "receive_protocol". Simply add two labels "abort_unlock" and "retry_unlock" before the existing "abort" and "retry" labels. At both new labels add "mutex_unlock(&connection->mutex[DATA_STREAM]);" and then continue with the "abort" respective "retry" code. Then use for the goto statements between the "mutex_lock" and "mutex_unlock" calls the new labels, which are 5 positions. BR, Jasmin **************************************************************************** On 11/27/2016 01:29 AM, Jasmin J. wrote: > Hello Lars! > >> If you have a two node setup, stay with DRBD 8.4, >> you don't gain anything from the new features of DRBD 9, > I thought DRBD 9 is much faster than DBRD 8.4? At least I can remember someone > mentioned this in a post somewhere. > >> but, as you found out, still may be hit by the regressions. > May I asked how stable you think is DRBD 9 compared to DRBD 8.4? > As a developer you have normally a good feeling for this. > > I really doesn't want to switch my cluster back to 8.4, if there isn't a good > reason for this. > >> I may have worked around this particular bug already, >> so it would no longer kill the box in the same situation, >> but the underlying issue, why it even would run into this, is unclear. > Was this commit > > http://git.drbd.org/drbd-9.0.git/commitdiff/a67cbb3429858b2e5faeb78dfc96820b516b2952 > > > I am a SW developer also and I know the bugs I did "implement" in the past. So > may I ask if you checked the whole driver source for similar situations, where > an error check jumped out of the normal control flow still holding a spinlock > or mutex? > When ever I found such a bug in my drivers, this rang a bell and I was > checking other candidates. Sometimes a found a hidden "feature" this way. > > BR, > Jasmin > > **************************************************************************** > > On 11/25/2016 04:22 PM, Lars Ellenberg wrote: >> On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote: >>> Hello, >>> >>> I have drbd 8.x in production for many years, without problems. >>> I have a single primary setup, SAS disks are in hardware RAID5, >>> partitions are assigned to drbd. >>> On top of drbd, there are virtual machines, which use drbd resources >>> as raw disks. >>> >>> I migrate from an old pair of servers to a new pair of servers and to >>> drbd 9.0.5. It went well until last midnight. >> >> If you have a two node setup, stay with DRBD 8.4, >> you don't gain anything from the new features of DRBD 9, >> but, as you found out, still may be hit by the regressions. >> >>> Resource R5 was doing initial synchronization between the two new >>> servers, both running drbd 9.0.5. >>> Resource R4 probably received some read/write load on the primary >>> node, as it ran nightly backup. >>> The bug came on the server that was the target of the synchronisation, >>> and drbd secondary. >>> >>> OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled >>> drbd 9.0.5 myself from the tarball. >>> >>> I'd like to know if this is a drbd bug, or a hardware issue? >> >> It likely is something in DRBD 9. >> >>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776] >> drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags) >> FAILED in drbd_try_rs_begin_io >> >>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut >>> here ]------------ >>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at >>> /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571! >> >> I may have worked around this particular bug already, >> so it would no longer kill the box in the same situation, >> but the underlying issue, why it even would run into this, is unclear. >> >> Anyways, this is a kernel thread hitting a BUG() while holding a >> spinlock, with irq disabled. >> >> Any further misbehavior is coming from there. >> >>> This task (drbd_a_r4) stayed with us until reboot, and kept 1 core at >>> 100% CPU Load: >> >> an other thread trying to grab the spinlock, >> which was never released by the BUG()ed and destroyed kernel thread. >> >>> The secondary was available on the network, but it was sluggish to work with. >>> >>> When I tried to reboot next morning, the reboot stalled, it couldn't >>> deconfigure the network, so I had to reboot using the power button... >> >> Yes, that would now be pretty much expected. >> I'd probably have done a >> for x in s u b ; do echo $x > /proc/sysrq-trigger ; sleep 1; done >> I'd even recommend to set these sysctls to have any server reboot itself >> if it triggers a BUG(). kernel.panic_on_oops = 1, kernel.panic = 30 >> >> > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user >