[DRBD-user] Fwd: kernel BUG on new server with drbd 9.0.5

Sun Nov 27 02:04:19 CET 2016

Hello Lars!

 > 
http://git.drbd.org/drbd-9.0.git/commitdiff/a67cbb3429858b2e5faeb78dfc96820b516b2952
I just reviewed your commit and I have a suggestion to omit the "have_mutex"
variable and two "if (have_mutex)" similar to the "disconnect_rcu_unlock"
label in function "receive_protocol".

Simply add two labels "abort_unlock" and "retry_unlock" before the existing
"abort" and "retry" labels.

At both new labels add "mutex_unlock(&connection->mutex[DATA_STREAM]);" and
then continue with the "abort" respective "retry" code.

Then use for the goto statements between the "mutex_lock" and "mutex_unlock"
calls the new labels, which are 5 positions.

BR,
    Jasmin

****************************************************************************

On 11/27/2016 01:29 AM, Jasmin J. wrote:
> Hello Lars!
>
>> If you have a two node setup, stay with DRBD 8.4,
>> you don't gain anything from the new features of DRBD 9,
> I thought DRBD 9 is much faster than DBRD 8.4? At least I can remember someone
> mentioned this in a post somewhere.
>
>> but, as you found out, still may be hit by the regressions.
> May I asked how stable you think is DRBD 9 compared to DRBD 8.4?
> As a developer you have normally a good feeling for this.
>
> I really doesn't want to switch my cluster back to 8.4, if there isn't a good
> reason for this.
>
>> I may have worked around this particular bug already,
>> so it would no longer kill the box in the same situation,
>> but the underlying issue, why it even would run into this, is unclear.
> Was this commit
>
> http://git.drbd.org/drbd-9.0.git/commitdiff/a67cbb3429858b2e5faeb78dfc96820b516b2952
>
>
> I am a SW developer also and I know the bugs I did "implement" in the past. So
> may I ask if you checked the whole driver source for similar situations, where
> an error check jumped out of the normal control flow still holding a spinlock
> or mutex?
> When ever I found such a bug in my drivers, this rang a bell and I was
> checking other candidates. Sometimes a found a hidden "feature" this way.
>
> BR,
>    Jasmin
>
> ****************************************************************************
>
> On 11/25/2016 04:22 PM, Lars Ellenberg wrote:
>> On Thu, Nov 24, 2016 at 03:25:28PM +0100, Laszlo Fiat wrote:
>>>  Hello,
>>>
>>> I have drbd 8.x in production for many years, without problems.
>>> I have a single primary setup, SAS disks are in hardware RAID5,
>>> partitions are assigned to drbd.
>>> On top of drbd, there are virtual machines, which use drbd resources
>>> as raw disks.
>>>
>>> I migrate from an old pair of servers to a new pair of servers and to
>>> drbd 9.0.5. It went well until last midnight.
>>
>> If you have a two node setup, stay with DRBD 8.4,
>> you don't gain anything from the new features of DRBD 9,
>> but, as you found out, still may be hit by the regressions.
>>
>>> Resource R5 was doing initial synchronization between the two new
>>> servers, both running drbd 9.0.5.
>>> Resource R4 probably received some read/write load on the primary
>>> node, as it ran nightly backup.
>>> The bug came on the server that was the target of the synchronisation,
>>> and drbd secondary.
>>>
>>> OS: Debian Jessie (stable) with Debian's stock kernel, and I compiled
>>> drbd 9.0.5 myself from the tarball.
>>>
>>> I'd like to know if this is a drbd bug, or a hardware issue?
>>
>> It likely is something in DRBD 9.
>>
>>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.104776]
>>     drbd r4/0 drbd4: ASSERTION test_bit(BME_NO_WRITES, &bm_ext->flags)
>>         FAILED in drbd_try_rs_begin_io
>>
>>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.107260] ------------[ cut
>>> here ]------------
>>> Nov 24 00:13:02 hufot-s-apl0004 kernel: [643314.109802] kernel BUG at
>>> /home/lanlaf/data/comp/drbd-9.0.5-1/drbd/lru_cache.c:571!
>>
>> I may have worked around this particular bug already,
>> so it would no longer kill the box in the same situation,
>> but the underlying issue, why it even would run into this, is unclear.
>>
>> Anyways, this is a kernel thread hitting a BUG() while holding a
>> spinlock, with irq disabled.
>>
>> Any further misbehavior is coming from there.
>>
>>> This task (drbd_a_r4) stayed with us until reboot, and kept 1 core at
>>> 100% CPU Load:
>>
>> an other thread trying to grab the spinlock,
>> which was never released by the BUG()ed and destroyed kernel thread.
>>
>>> The secondary was available on the network, but it was sluggish to work with.
>>>
>>> When I tried to reboot next morning, the reboot stalled, it couldn't
>>> deconfigure the network, so I had to reboot using the power button...
>>
>> Yes, that would now be pretty much expected.
>> I'd probably have done a
>>     for x in s u b ; do echo $x > /proc/sysrq-trigger ; sleep 1; done
>> I'd even recommend to set these sysctls to have any server reboot itself
>> if it triggers a BUG().  kernel.panic_on_oops = 1, kernel.panic = 30
>>
>>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>