Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, Here is a preliminary patch to fix drbd-0.7.12 . But I am not yet happy with that patch, because I do not really understand where the bug is. To Andre and Jeff, Could you please verify that the patch fixes the problem ? For people who think that they know something about kernel programming: Problem description the function _drbd_process_ee() [ in drbd_receiver.c ] is called with the mdev->ee_lock held. Thus there are never two parallel threads of execution entering that function at the same time. But that function has to release the spinlock in oder to call the callbacks on those 'epoch_entries'. So a second CPU could enter that function if the first is just processing a callback. To avoid this there is this PROCESS_EE_RUNNING bit, and the ee_wait wait queue. An execution context has to grab that bit first, before it procedes with the body of the function, otherwise it sleeps on ee_wait, until the other CPU releases the bit and calls wake_up(ee_wait). Here is the code: STATIC int _drbd_process_ee(drbd_dev *mdev, int be_sleepy) { struct Tl_epoch_entry *e; struct list_head *head = &mdev->done_ee; struct list_head *le; int ok=1; int got_sig; MUST_HOLD(&mdev->ee_lock); reclaim_net_ee(mdev); if( test_and_set_bit(PROCESS_EE_RUNNING,&mdev->flags) ) { if(!be_sleepy) { return 3; } spin_unlock_irq(&mdev->ee_lock); got_sig = wait_event_interruptible(mdev->ee_wait, test_and_set_bit(PROCESS_EE_RUNNING,&mdev->flags) == 0); spin_lock_irq(&mdev->ee_lock); if(got_sig) return 2; } while(!list_empty(head)) { le = head->next; list_del(le); spin_unlock_irq(&mdev->ee_lock); e = list_entry(le, struct Tl_epoch_entry, w.list); ok = ok && e->w.cb(mdev,&e->w,0); spin_lock_irq(&mdev->ee_lock); drbd_put_ee(mdev,e); } clear_bit(PROCESS_EE_RUNNING,&mdev->flags); wake_up(&mdev->ee_wait); return ok; } But for unknown reasons, after some time the PROCESS_EE_RUNNING bit is set but no CPU is in the function, therefore it never gets cleared again -> the receiver thereads blocks forever -> the whole resync process is stalled. Further facts: * This happens only on SMP machines (up to now tested on x86 only) * As soon as the PROCESS_EE_RUNNING bit is moved to an other long word it works !?!?! Changes that do _not_ fix the problem, but I have already tried: * smp_mb__after_clear_bit() after the clear_bit() * wake_up_all() instead of wake_up() What are your conclusions from this ? -Philipp -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com : -------------- next part -------------- A non-text attachment was scrubbed... Name: drbd-0.7.12-syncer-fix.diff Type: text/x-diff Size: 1334 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20050831/fdc9a21c/attachment.diff>