[DRBD-user] strange drbd bug

Wed Oct 15 14:16:29 CEST 2014

On Tue, Oct 14, 2014 at 10:37:13PM +0200, Lars Ellenberg wrote:
> _drbd_thread_stop takes a "wait" parameter.
> It's a helper function for drbd_thread_stop()
> and for drbd_thread_stop_nowait().
> It is possible that "somewhere" we call drbd_thread_stop()
> where we should call drbd_thread_stop_nowait().
> But I don't see where that would be.
> 
> Same helper function is called for drbd_thread_restart_nowait().
> I'd say that's what happens here.
> 
> So from the state change triggered by the request_timer_fn
> in softirq context ("current" is "swapper" as seen from the other mail),
> we trigger a drbd_thread_restart_nowait().
> 
> Which calls kthread_create -> drbd_thread_setup
> 
> And *there* it bombs out.
> 
> Yes, I see it happening now.
> 
> I just wonder why it does not happen *always* then, *every* time that
> request timer expires and causes the peer to be kicked out?
> 
> maybe something changed around kthread_create as well.
> 
> Or maybe we just have been lucky for the last few years,
> and the "wake_up_process(); wait_on_completion()" in kthread_run
> just so happened to never need to call schedule,
> because it always took the fast path.

Hm. No.
Normally, when we call this drbd_thread_restart_nowait,
the respective thread is still alive, and it just
"cycles" that thread, and does not need to create a new one.

In this case, apparently there was a race with the thread terminating
for other reasons and this call, and now we need to call into
kthread_create.

Ok, we know where it is broken.
Still we'll have to think about how to fix it best.

  	Lars