[Drbd-dev] DRBD8: Receive_state() won't dec_local after a disk failure on peer.

Fri Jun 29 23:40:45 CEST 2007

Hi all,
We have been seeing a problem where a cluster of two systems, X and Y.
X is Primary and gets a disk fault.  X goes Diskless.
Y now is forced to be Primary. 
X recovers from the fault. 
But now Y gets a disk fault and goes Diskless but Stay Primary.
At this point I/O from r0 hangs on Y!  

A check on /proc/<ip>/wchan for the worker thread reveals that we are
waiting forever for local_cnt to become 0 in after_state_ch(). So the
worker thread will process the Net_read.  What happened is that after
the first failure on X, receive_state() on Y failed to call dec_local().
The pdisk received state is Diskless therefore we won't dec_local(). The
included patch illustrates the problem and attempts to fix it. 

Thanks.
EM--
-------------- next part --------------
A non-text attachment was scrubbed...
Name: drbd_recv.patch
Type: application/octet-stream
Size: 710 bytes
Desc: drbd_recv.patch
Url : http://lists.linbit.com/pipermail/drbd-dev/attachments/20070629/58498fa6/drbd_recv.obj