[Drbd-dev] DRBD8: disconnecting while already disconnecting can hang the receiver

Montrose, Ernest Ernest.Montrose at stratus.com
Tue Nov 27 14:06:46 CET 2007


Phil,
I looked at my notes...To reproduce this you can fake the condition this
way:
* Issue a disconnect on node0 for r5.
* Locally on node1 we will get into drbd_receiver.c:drbd_disconnect()
and while there in drbd_disconnect() (Put a small delay there or
something); issue a "drbdsetup /dev/drbd5 disconnect".

This last drbdsetup will time out with " No response from the DRBD
driver! Is the module loaded?"
But the driver will be waiting forever in
drbd_nl.c:drbd_nl_disconnect().

Hope that helps. If not I'll back out my fix and sent you the exact
instrumentation to reproduce it.

EM--

-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner at linbit.com] 
Sent: Tuesday, November 27, 2007 5:36 AM
To: drbd-dev at linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: disconnecting while already disconnecting
can hang the receiver

On Monday 19 November 2007 00:11:36 Montrose, Ernest wrote:
> Hi all,
> There is problem that manifest itself this way:
>
> Consider 2 nodes A and B,  "A" issues a disconnect to r2, B gets into
> drbd_receiver.c: drbd_disconnect().  While B is disconnecting, it gets
a
> "disconnect" request for r2.  This hangs the receiver.
>
> I am thinking that we should just not allow the state transition to
> "disconnecting" if we are already doing so. We could redefine
"Standalone"
> to mean less then or equal to "TearDown" in some cases.  I include a
patch
> to show this.
>

Hi Ernest,

I tried hard to reproduce/understand this. I tried with various 
instrumentations but I can not reproduce this. 

I assumed that it "hangs" in the drbd_state_lock() function, but
I could not find it by experiment nor by drawing timing diagrams.

Could you provide some LOGs of this event ?

Thanks!

The best I get:

Node1:
[42951592.560000] drbd0: state_locked
[42951592.560000] drbd0: state_unlocked
[42951592.560000] drbd0: peer( Secondary -> Unknown ) conn( Connected ->
Disconnecting ) pdsk( UpToDate -> DUnknown )
[42951592.560000] drbd0: state_locked
[42951592.560000] drbd0: state_unlocked
[42951592.560000] drbd0: Writing meta data super block now.
[42951592.560000] drbd0: sock was shut down by peer
[42951592.560000] drbd0: short read expecting header on sock: r=0
[42951592.560000] drbd0: sock_recvmsg returned -104
[42951592.560000] drbd0: asender terminated
[42951592.560000] drbd0: tl_clear()
[42951592.560000] drbd0: Connection closed
[42951592.560000] drbd0: conn( Disconnecting -> StandAlone )
[42951592.560000] drbd0: receiver terminated

Node2:
[42951603.570000] drbd0: state_locked
[42951603.570000] drbd0: peer( Secondary -> Unknown ) conn( Connected ->
TearDown ) pdsk( UpToDate -> DUnknown )
[42951603.570000] drbd0: Writing meta data super block now.
[42951603.570000] drbd0: state_unlocked
[42951603.570000] drbd0: conn( TearDown -> Disconnecting )
[42951603.570000] drbd0: asender terminated
[42951603.570000] drbd0: tl_clear()
[42951603.570000] drbd0: Connection closed
[42951603.570000] drbd0: conn( Disconnecting -> StandAlone )
[42951603.570000] drbd0: receiver terminated

Of course the state transition TearDown -> Disconnecting is not
right/fine, but
I can not reproduce a hang of the receiver...

-phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :


More information about the drbd-dev mailing list