Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, Feb 14, 2016 at 07:39:36PM +0900, 김재헌 wrote:
> Dear Philipp,
>
> Please check my previous question of CASE-14("[DRBD-user] [CASE-14] primary
> node hang by VM-net-disconnect during big file copy").
> According to this case, Linux drbd deadlock may occur.
> On the other hand, Windows side there is no deadlock but sometimes the
> transfer_log list is broken in _tl_restart function.
>
> So, We are trying to modify the source code as follows:
>
> 1. Modifications
>
> 1) in drbd_send_and_submit()
>
> if (likely(req->i.size != 0)) {
> if (rw == WRITE) {
> struct drbd_request *req2;
> resource->current_tle_writes++;
> #if 0 // WIN32 ### ignore tail_recursion ###
> list_for_each_entry_reverse(req2, &resource->transfer_log, tl_requests) {
> if (req2->rq_state[0] & RQ_WRITE) {
> /* Make the new write request depend on
> * the previous one. */
> kref_get(&req->kref);
> break;
> }
> }
> #endif
> }
>
> list_add_tail(&req->tl_requests, &resource->transfer_log);
> }
>
>
> 2) in drbd_req_destroy()
>
> if (s & RQ_WRITE && req_size) {
> list_for_each_entry(req, &device->resource->transfer_log, tl_requests) {
> if (req->rq_state[0] & RQ_WRITE) {
> /*
> * Do the equivalent of:
> * kref_put(&req->kref, drbd_req_destroy)
> * without recursing into the destructor.
> */
> #if 0 // WIN32 ### ignore tail_recursion ###
> if (atomic_dec_and_test(&req->kref.refcount))
> goto tail_recursion;
> #endif
> break;
> }
> }
> }
>
>
> 2. Questions
>
> 1) This part of "tail_recursion" is a new design on verson 9.
> Is this essential operation?
> I mean, what do you think about my ignoring tail_recursion part for
> temporary workaround?
I cannot explain all the implications within two lines of text, but we
want the "destruction" of drbd_requests to happen in the order they have
been put on the transfer log.
That is important in some multi-peer scenarios.
If there is no explicit dependency (here implemented via kref), they
could be destroyed out-of-order, and that could lead to bad decisions
elsewhere, or potentially not enough being resynced in some scenarios.
> 2) And what is the reason for the marking of "kref_get(&req->kref);" in
> drbd_send_and_submit and processing with recursion in drbd_req_destroy
> later?
see above.
> 3) On Windows side, we ignore this part(see source code of "#if 0 // WIN32
> ### ignore tail_recursion ###").
> Anyway, after ignore, Windows drbd engine works well, till now. Is
> there any problem?
see above.
> On Linux side, you cannot see this list-crash-case because the CASE-14 test
> may be done by deadlock first.
> Please check the CASE-14 deadlock case first and then check this CASE-20.
Cheers,
Lars