[Drbd-dev] DRBD8: disconnecting while already disconnecting can
hang the receiver
Montrose, Ernest
Ernest.Montrose at stratus.com
Tue Nov 27 22:51:21 CET 2007
Phil,
Phil,
Your modification to the original patch will break it actually. The
reason is that we can get into "disconnecting" anywhere. Below I have
some logs with the problem happening.
On Node0:
# drbdsetup /dev/drbd16 disconnect
Nov 27 16:38:20 node1 kernel: drbd16: peer( Secondary -> Unknown ) conn(
Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
Nov 27 16:38:20 node1 kernel: drbd16: Creating new current UUID
Nov 27 16:38:20 node1 kernel: drbd16: short read expecting header on
sock: r=-512
Nov 27 16:38:20 node1 kernel: drbd16: asender terminated
Nov 27 16:38:20 node1 kernel: drbd16: tl_clear()
Nov 27 16:38:20 node1 kernel: drbd16: Connection closed
Nov 27 16:38:20 node1 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:20 node1 kernel: drbd16: conn( Disconnecting -> StandAlone
)
Nov 27 16:38:20 node1 kernel: drbd16: receiver terminated
Nov 27 16:38:23 node1 kernel: drbd16: conn( StandAlone -> Unconnected )
Nov 27 16:38:23 node1 kernel: drbd16: receiver (re)started
Nov 27 16:38:23 node1 kernel: drbd16: conn( Unconnected -> WFConnection
)
Nov 27 16:38:26 node1 kernel: drbd16: conn( WFConnection ->
WFReportParams )
Nov 27 16:38:26 node1 kernel: drbd16: Handshake successful: DRBD Network
Protocol version 86
Nov 27 16:38:26 node1 kernel: drbd16: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Nov 27 16:38:26 node1 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:26 node1 kernel: drbd16: conn( WFBitMapS -> SyncSource )
pdsk( UpToDate -> Inconsistent )
Nov 27 16:38:26 node1 kernel: drbd16: Began resync as SyncSource (will
sync 4 KB [1 bits set]).
Nov 27 16:38:26 node1 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:26 node1 kernel: drbd16: Resync done (total 1 sec; paused 0
sec; 4 K/sec)
Nov 27 16:38:26 node1 kernel: drbd16: conn( SyncSource -> Connected )
pdsk( Inconsistent -> UpToDate )
Nov 27 16:38:27 node1 kernel: drbd16: Writing meta data super block now.
======On Node1============
Nov 27 16:38:20 node0 kernel: drbd16: peer( Primary -> Unknown ) conn(
Connected -> TearDown ) pdsk( UpToDate -> DUnknown )
Nov 27 16:38:20 node0 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:20 node0 kernel: drbd16: meta connection shut down by peer.
Nov 27 16:38:20 node0 kernel: drbd16: asender terminated
Nov 27 16:38:20 node0 kernel: drbd16: tl_clear()
Nov 27 16:38:20 node0 kernel: drbd16: Connection closed
Nov 27 16:38:20 node0 kernel: drbd16: conn( TearDown -> Unconnected )
Nov 27 16:38:20 node0 kernel: drbd16: drbd_disconnect: ##5# EM-- Done
but waiting 30 seconds######
====Issue disconnect here=====
# drbdsetup /dev/drbd16 disconnect
No response from the DRBD driver! Is the module loaded?
Nov 27 16:38:26 node0 kernel: drbd16: conn( Unconnected -> Disconnecting
)
Nov 27 16:38:26 node0 kernel: drbd16: drbd_nl_disconnect: EM-- Start
wait_event_interruptible for mdev->state.conn==StandAlone ****
Nov 27 16:38:26 node0 kernel: drbd16: drbd_disconnect: ##5# EM-- Done
##### waiting 30 seconds######
Nov 27 16:38:26 node0 kernel: drbd16: receiver terminated
Nov 27 16:38:26 node0 kernel: drbd16: receiver (re)started
Nov 27 16:38:26 node0 kernel: drbd16: ASSERT( mdev->state.conn >=
Unconnected ) in
/sandbox/emontros/devel/trunk/platform/drbd/src/drbd/drbd_receiver.c:715
Nov 27 16:38:26 node0 kernel: drbd16: conn( Disconnecting ->
WFConnection )
Nov 27 16:38:26 node0 kernel: drbd16: conn( WFConnection ->
WFReportParams )
Nov 27 16:38:26 node0 kernel: drbd16: Handshake successful: DRBD Network
Protocol version 86
Nov 27 16:38:26 node0 kernel: drbd16: receive_state: EM-- ....
Nov 27 16:38:26 node0 kernel: drbd16: receive_state: EM-- ....calling
sync_handshake
Nov 27 16:38:26 node0 kernel: drbd16: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Nov 27 16:38:26 node0 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:26 node0 kernel: drbd16: conn( WFBitMapT -> WFSyncUUID )
Nov 27 16:38:26 node0 kernel: drbd16: conn( WFSyncUUID -> SyncTarget )
disk( UpToDate -> Inconsistent )
Nov 27 16:38:26 node0 kernel: drbd16: Began resync as SyncTarget (will
sync 4 KB [1 bits set]).
Nov 27 16:38:26 node0 kernel: drbd16: Writing meta data super block now.
Nov 27 16:38:27 node0 kernel: drbd16: Resync done (total 1 sec; paused 0
sec; 4 K/sec)
Nov 27 16:38:27 node0 kernel: drbd16: conn( SyncTarget -> Connected )
disk( Inconsistent -> UpToDate )
Nov 27 16:38:27 node0 kernel: drbd16: Writing meta data super block now.
===Done logging====
Notice that on node1 we never transition to Standalone after the
disconnect. It is because of that that we wait forever.
-----Original Message-----
From: Philipp Reisner [mailto:philipp.reisner at linbit.com]
Sent: Tuesday, November 27, 2007 9:53 AM
To: drbd-dev at linbit.com
Cc: Montrose, Ernest
Subject: Re: [Drbd-dev] DRBD8: disconnecting while already disconnecting
can hang the receiver
On Tuesday 27 November 2007 14:06:46 Montrose, Ernest wrote:
> Phil,
> I looked at my notes...To reproduce this you can fake the condition
this
> way:
> * Issue a disconnect on node0 for r5.
> * Locally on node1 we will get into drbd_receiver.c:drbd_disconnect()
> and while there in drbd_disconnect() (Put a small delay there or
> something); issue a "drbdsetup /dev/drbd5 disconnect".
>
> This last drbdsetup will time out with " No response from the DRBD
> driver! Is the module loaded?"
> But the driver will be waiting forever in
> drbd_nl.c:drbd_nl_disconnect().
>
Yes. This is what I tested. I had a delay in drbd_disconenct().
I did not managed to get it into troubles.
BTW, while looking at the patch, I would have done it like this:
@@ -589,7 +589,8 @@ STATIC int is_valid_state_transition(drbd_dev*
mdev,drbd_state_t ns,drbd_state_t
if( (ns.conn == StartingSyncT || ns.conn == StartingSyncS ) &&
os.conn > Connected) rv=SS_ResyncRunning;
- if( ns.conn == Disconnecting && os.conn == StandAlone)
+ if ( ns.conn == Disconnecting &&
+ ( os.conn == StandAlone || os.conn == TearDown ) )
rv=SS_AlreadyStandAlone;
if( ns.disk > Attaching && os.disk == Diskless)
-Phil
--
: Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
More information about the drbd-dev
mailing list