Philipp Reisner philipp.reisner at linbit.com
Wed Jul 5 18:15:01 CEST 2006

> Apologies for the detail below, but I want to make sure I'm going about
> this the right way - Here's what I'm thinking as a way to fix this --
> please comment; you know this code so much better than I do!
> 1. Add a new field in the mdev - rs_failed - that counts the number of
> NegDSReply's received, init to zero
>    at start of resync


> 2. Move the code that checks for end of resync into a new routine -
> drbd_check_for_end_resync() and change it
>    to check if the bitmap weight is <= rs_failed.


> 3. Change drbd_try_to_clean_on_disk_bm to schedule w_update_odbm if
> _any_ bits are cleared on disk (perhaps it should
>    be some-bit-cleared AND (rs_failed!=0 || extent-now-completely-clear)
> - that wont change the current behavior if
>    no failures occur -- I'm just a bit worried about doing this too
> often...

I see the problem here... And I have am advice for you.
The bm_extent holds the number of dirty bit for the extent (rs_left).
Add a member there that holds the number of IO errors for that
sync extent (rs_failed).
... Do you know by now what I mean ?

> 4. Add a call to drbd_check_for_end_resync() in got_NegDSReply() to
> handle the case where the last block failed.


> 5. Find all the places where rs_total, rs_mark_left and the bitmap
> weight are referenced and include rs_failed as
>    necessary (e.g. BM_PARANOIA_CHECK in drbd_bitmap.c).

