Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Oct 14, 2014 at 10:37:13PM +0200, Lars Ellenberg wrote:
> _drbd_thread_stop takes a "wait" parameter.
> It's a helper function for drbd_thread_stop()
> and for drbd_thread_stop_nowait().
> It is possible that "somewhere" we call drbd_thread_stop()
> where we should call drbd_thread_stop_nowait().
> But I don't see where that would be.
>
> Same helper function is called for drbd_thread_restart_nowait().
> I'd say that's what happens here.
>
> So from the state change triggered by the request_timer_fn
> in softirq context ("current" is "swapper" as seen from the other mail),
> we trigger a drbd_thread_restart_nowait().
>
> Which calls kthread_create -> drbd_thread_setup
>
> And *there* it bombs out.
>
> Yes, I see it happening now.
>
> I just wonder why it does not happen *always* then, *every* time that
> request timer expires and causes the peer to be kicked out?
>
> maybe something changed around kthread_create as well.
>
> Or maybe we just have been lucky for the last few years,
> and the "wake_up_process(); wait_on_completion()" in kthread_run
> just so happened to never need to call schedule,
> because it always took the fast path.
Hm. No.
Normally, when we call this drbd_thread_restart_nowait,
the respective thread is still alive, and it just
"cycles" that thread, and does not need to create a new one.
In this case, apparently there was a race with the thread terminating
for other reasons and this call, and now we need to call into
kthread_create.
Ok, we know where it is broken.
Still we'll have to think about how to fix it best.
Lars