[DRBD-user] Concurrent writes

Tue Apr 21 10:37:47 CEST 2009

On Mon, Apr 20, 2009 at 07:05:31PM -0400, Gennadiy Nerubayev wrote:
> On Mon, Apr 20, 2009 at 6:28 AM, Lars Ellenberg
> <lars.ellenberg at linbit.com>wrote:
> 
> > On Mon, Apr 20, 2009 at 12:13:58PM +0200, Lars Ellenberg wrote:
> > >
> > > > 2. What's the worst case scenario (lost write? corrupt data? unknown
> > > > consistency?) that can result from concurrent writes?
> > >
> > > DRBD currently _drops_ the later write.
> > > if a write is detected as a concurrent local write,
> > > this one is never submitted nor send, and just "completed",
> > > pretending that it had been successfully written.
> > >
> > > we considered to _fail_ such writes with EIO,
> > > but decided to rather complain loudly, but pretend success.
> >
> 
> So to clarify, in a typical scenario an initiator should not be issuing a
> write request while one for the same (or overlapping) block is not yet
> returned by DRBD as successful, however that's what happens?

yes.

possibly the target "announces" the equivalent of "tagged command
queueing" in iSCSI, and the initiator tries to take advantage of that,
but either target or initiator implement that incorrectly.
not sure how to verify this assumption, maybe using wireshark on the
iSCSI layer (which would also be a way to get to the actual data
of the overlapping requests).

> Is it at all possible that DRBD returns success earlier than it should
> have (obviously I'm using protocol C)?

No.
also DRBD protocol choice does not make a difference in this context.

simplified, the detection of these overlapping requests happens within
DRBD by a list walk. "pending" request objects get unliked from these
lists before they are completed to upper layers.

> please post a few of the original log lines,
> > they should read something like
> > <comm>[pid] Concurrent local write detected!
> > [DISCARD L] new: <sector offset>s +<size in bytes>;
> >        pending: <sector offset>s +<size in bytes>
> >
> > I'm curious as to what the actual overlap is,
> > and in if there is any correlation between offsets.
> 
> 
> Here's an example for 8k random writes:
> 
> Apr 14 12:24:35 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 162328976s +8192; pending: 162328976s +8192
> Apr 14 12:24:38 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 161385248s +8192; pending: 161385248s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 157655888s +8192; pending: 157655888s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 165753872s +8192; pending: 165753872s +8192
> Apr 14 12:24:39 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 166654816s +8192; pending: 166654816s +8192
> Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 158260592s +8192; pending: 158260592s +8192
> Apr 14 12:24:40 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 163944704s +8192; pending: 163944704s +8192
> Apr 14 12:24:49 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 169511744s +8192; pending: 169511744s +8192
> Apr 14 12:24:51 srpt1 kernel: drbd0: scsi_tgt0[17271] Concurrent local write
> detected! [DISCARD L] new: 170614416s +8192; pending: 170614416s +8192
> Apr 14 12:24:52 srpt1 kernel: drbd0: scsi_tgt1[17272] Concurrent local write
> detected! [DISCARD L] new: 158642368s +8192; pending: 158642368s +8192

so new and pending requests are in fact the verry same area.
interessting.

> 128k:
> 
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689092s +28672; pending: 562689144s +4096
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689172s +20480; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689148s +2048; pending: 562689144s +4096
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689212s +2048; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689152s +2048; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689216s +2048; pending: 562689216s +24576
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689156s +8192; pending: 562689152s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689220s +28672; pending: 562689264s +8192
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt0[21193] Concurrent local write
> detected! [DISCARD L] new: 562689292s +8192; pending: 562689280s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689276s +2048; pending: 562689264s +8192
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689280s +2048; pending: 562689280s +32768
> Apr 17 15:09:23 srpt1 kernel: drbd0: scsi_tgt1[21194] Concurrent local write
> detected! [DISCARD L] new: 562689284s +4096; pending: 562689280s +32768

these are overlapping, partially or completely.
some of the "new" offset/size tuples occur repeatedly.

also interessting.
I'm not sure though what we can make of that information.

> I know it shows two SCST threads, but it's the same thing even if I disable
> SCST threading (not to mention having the same thing happen with IET).
> 
> As I was about to send this email, I got another discovery: Concurrent local
> writes do not happen when the DRBD device is disconnected.
>
> As soon as I reconnect, they reappear, and this is as mentioned above
> using protocol C.

sorry to disappoint you.
they are not checked for when disconnected ;(

data divergence due to conflicting (overlapping)
writes cannot happen when DRBD is not connected.
so in this case DRBD does not care.

the user is allowed to submit as much garbage to DRBD as it wants to.
DRBD "only" replicates whatever is submitted, and makes sure that
during normal operation, both replicas are bitwise identical.

That is the reason why DRBD complains loudly about conditions which make
this not possible in the general case, and enables "workarounds", so we
can hold up the "bitwise identical", even if that means we have to drop
such conflicting writes.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed