[DRBD-user] 4Kib backing stores -> virtual device sector size ?

Fri Nov 20 11:57:28 CET 2020

On Wed, Nov 18, 2020 at 01:45:00AM +0000, Roberts, Keith wrote:
> Hi Community
> 
> I am currently using two production servers with
> kmod-drbd90-9.0.20-1.el7_7.elrepo.x86_64 on the 512 byte server + latest Centos7
> kmod-drbd90-9.0.22-2.el7_8.elrepo.x86_64 on the 4Kib server. (couple of month old Centos7)
>
> Something seems to have changed.  In the past independent of the 4Kib
> sector size the drbd virtual device presented a hw_sector_size of 512
> bytes.  I am now seeing it spontaneously change to 4KiB when connected
> to a host with 4KiB backing store enterprise drive.
>
> Step 1 create drbd resource on both systems.
> Step 2 on 512 byte sector server make it primary and check /sys/block/drbd4/queue/hw_sector_size it is 512 bytes
> Step 3 use drive all is fine with 512 byte physical sector size.
> Step 4 Make it secondary and make it primary on other server.  Check /sys/block/drbd4/queue/hw_sector_size on other server = 4096 bytes.
> Step 4 Switch back to primary on original server now sector size has switched to 4096 bytes and drive is unusable as the sector disk label is not handled correctly.
> 
> Where I saw this issue was I gpt labeled a drive in 512 byte mode and
> then it became unreadable once the driver flipped it to being a 4096
> byte sector.
> 
> I am very wary of changing anything (e.g. upgrading consistent
> versions) without understanding the mechanism as I don't want to
> corrupt other volumes that are currently operating as 512 byte sectors
> without a problem.
> 
> I do not see any calls to blk_queue_hardsect_size in the drbd driver
> so I don't know what is changing the sector size to 4096 to reflect
> the local backing store but I would really like to understand what
> drives this as it does not appear to be backward compatible.
> 

Nothing changed on the DRBD side.

In some way, you could see DRBD as a advanced "dd".  Copying the full
disk image from one IO backend (512 byte logical sector size) to an
other IO backend (4096 byte logical sector size) and then trying to
access it from there will give you the exact same problem.
Or no problem at all, if the specific file system or other usage can
tolerate that change.

In general, do not mix IO backends of different characteristics.

Specifically, for the mentioned issue, do not mix backends with different "logical_block_size".

Note that "hw_sector_size" is an alias to "logical_block_size", NOT physical_block_size.

As long as logical_block_size is the same, physical_block_size may be
different, some translation layer below DRBD will do the work.

For differing logical block sizes,
what DRBD currently does (and that did not change):

DRBD does communicate these block device queue topology settings during a handshake.
- ignores the "physical" size
- checks the "logical" size
  if logical sizes do not match, we
    - complain: "logical block sizes do not match (me:%u, peer:%u); this may cause problems."
    - we disable "write same" support,
      because any forwarded "write same"
      request with mismatched logical size would trigger a BUG_ON in the
      scsi layer.
    - if we are currently Secondary (unused),
      we adjust the logical block of our queue
    - if we are currently Primary (in use), we must not change the block
      size, so we don't

What we will not do is "translate".

Yes, that means you end up with different logical block sizes on DRBD
depending on the specific order "attach", "connect" and "promote".
 :-(

If you manage to always first connect all nodes,
*before* you start to use them, they should all agree
on the maximum logical block size.

Even though now DRBD may pretend to have logical block size larger than
its local backend, you may then get away with using it, as long as you
only do regular read/write.  Or it may break in creative ways later.

What we could easily do is refuse to connect, if peers detect different
logical block sizes on the backends.  Does not feel right, and would
prevent a "rolling migration" to new storage systems.

What we will NOT do: implement a "translation" layer (read-modify-write
when pretending to support 512 while the backend is 4k).

If you need that, implement that below DRBD.  Your device driver (or
storage backend) may have an option to enable such a translation mode.
Maybe that has changed, and is no longer enabled?

What we could potentially do: "remember" (or explicitly set) a target
block size in meta data, detect that during attach, and apply it to our
queue parameters.

What that would give us is again a (potential) logical block mismatch
between DRBD and its respective local backends.
Which, as I already pointed out, will lead to problems when something
more interesting than a regular read/write is to be processed.

So we decided to not do that.

If you have a file system or other use on top of DRBD that can not
tolerate a change in logical block size from one "mount" to the next,
then make sure to use IO backends with identical (or similar enough)
characteristics.

If you have a file system that can tolerate such a change,
you may get away with formating for 4k, and using it from both
512 and 4k backends.

But keep in mind that DRBD does NOT do any translation,
so again: anything more "interesting" than regular read/write may still
lead to problems when DRBD relays that to the respective backend.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD® and LINBIT® are registered trademarks of LINBIT
__
please don't Cc me, but send to list -- I'm subscribed