Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Ok, here is the real fix!!! The patch is against drbd-0.7.12 plain. This was really hard. In the kernel's API there are two variants of all bitops. The atomic ones set_bit(), clear_bit(), test_bit() etc... and the non atomic ones __set_bit(), __clear_bit() ... The race condition: CPU1 was in an IO completion handler and used the __set_bit(SYNC_STARTED,..) there. Non atomic means: First, it fetched the word from memory.... ... CPU2 was exiting the _drbd_process_ee() function and did the clear bit clear_bit(PROCESS_EE_RUNNING) atomic = fetch, modify and write... ... back on CPU1 we now do the modify and write... So CPU2 sets the PROCESS_EE_RUNNING bit again, because it fetched the word before CPU1 did it's atomic update. So I conclude, that the rule is: If you use the atomic bitops on a word, you may never ever user the non atomic bitops on the same word anywhere in your code. But it feels good, to understand what was going on! -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com : -------------- next part -------------- A non-text attachment was scrubbed... Name: drbd-0.7.12-syncer-fix.diff Type: text/x-diff Size: 658 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20050831/d93065ef/attachment.diff>