[DRBD-user] 4Kib backing stores -> virtual device sector size ?

Fri Nov 20 15:10:12 CET 2020

Thank you both for your responses.  Sorry for my original confusion about saying something had changed which was not the case.

I understand Lars's comment disabling the mixed connection completely for a user who can tolerate the dynamic switch may not be appropriate.  

Avoiding emulating within DRBD a 512 byte device with a read modify write is clear.  I don't know the virtual device layer well enough to understand why providing a config attribute for a resource to present a fixed 4k virtual device size is problematic. I take your word for it, that this is not a good idea.

Without this ability perhaps a config file attribute (non default) that permits a mixed assembly would be helpful.  A user would then discover the issue when they first connect/attach the mix rather than when the virtual drive is started without the 4kN devices and the virtual drive drops to 512 as I observed. With multiple mirrors this may not show up in minimal failover testing.

Seems like the combo of workstation 512e NVME drives and remote 4kN enterprise backing stores will become more common and I won't be the last user to attempt this.  

In my case QEMU was providing a translation layer above DRBD which likely hid the issue.  I don't know whether QEMU supports the size changing dynamically below it but I wasn't planning to rely on it.

Hi Keith,

I guess you are right about DRBD not considering backing devices with different physical_block_size . We will look into it...

best regards,
 Phil

On Wed, Nov 18, 2020 at 6:24 PM Roberts, Keith <Keith.Roberts at teledyne.com> wrote:
>
> Dear Philipp
>
>
>
> I don't want to compound my misinformation by double posting and am not sure how to reply to my own post (as I didn't receive an email to reply to) and have it show in the same thread. Sorry to bother you with a direct email but though that might be less confusing.
>
>
>
> I stated in the attached posting that the old behavior was to maintain a 512 byte sector size independing of 4ki block backing.  In fact I have no evidence of that (I had misinterpreted some information).
>
>
>
> The example I gave is repeatable where the sector size spontaneously changes but I can't say this is a new behavior.
>
>
>
> In fact I was also able to demonstrate the sector size spontaneously changing once a remote mirror comes on line and connects (while the drive was already primary on the local machine).  Having a mix of NVME (512 byte) client drives and backing server 4ki magnetic drives this seems like a reasonable use case.  Seems like however drbd is figuring out to switch to 4ki needs a manual override to force the larger sector size.
>
>
>
> My apologies for the initial mis-information in my original  post and my incompetence with the mailing list.
>
>
>
> Regards
>
>
>
> Keith Roberts
>

-----Original Message-----
From: drbd-user-bounces at lists.linbit.com <drbd-user-bounces at lists.linbit.com> On Behalf Of Lars Ellenberg
Sent: Friday, November 20, 2020 5:57 AM
To: drbd-user at lists.linbit.com
Subject: Re: [DRBD-user] 4Kib backing stores -> virtual device sector size ?

---External Email---

On Wed, Nov 18, 2020 at 01:45:00AM +0000, Roberts, Keith wrote:
> Hi Community
> 
> I am currently using two production servers with
> kmod-drbd90-9.0.20-1.el7_7.elrepo.x86_64 on the 512 byte server + 
> latest Centos7
> kmod-drbd90-9.0.22-2.el7_8.elrepo.x86_64 on the 4Kib server. (couple 
> of month old Centos7)
>
> Something seems to have changed.  In the past independent of the 4Kib 
> sector size the drbd virtual device presented a hw_sector_size of 512 
> bytes.  I am now seeing it spontaneously change to 4KiB when connected 
> to a host with 4KiB backing store enterprise drive.
>
> Step 1 create drbd resource on both systems.
> Step 2 on 512 byte sector server make it primary and check 
> /sys/block/drbd4/queue/hw_sector_size it is 512 bytes Step 3 use drive all is fine with 512 byte physical sector size.
> Step 4 Make it secondary and make it primary on other server.  Check /sys/block/drbd4/queue/hw_sector_size on other server = 4096 bytes.
> Step 4 Switch back to primary on original server now sector size has switched to 4096 bytes and drive is unusable as the sector disk label is not handled correctly.
> 
> Where I saw this issue was I gpt labeled a drive in 512 byte mode and 
> then it became unreadable once the driver flipped it to being a 4096 
> byte sector.
> 
> I am very wary of changing anything (e.g. upgrading consistent
> versions) without understanding the mechanism as I don't want to 
> corrupt other volumes that are currently operating as 512 byte sectors 
> without a problem.
> 
> I do not see any calls to blk_queue_hardsect_size in the drbd driver 
> so I don't know what is changing the sector size to 4096 to reflect 
> the local backing store but I would really like to understand what 
> drives this as it does not appear to be backward compatible.
> 

Nothing changed on the DRBD side.

In some way, you could see DRBD as a advanced "dd".  Copying the full disk image from one IO backend (512 byte logical sector size) to an other IO backend (4096 byte logical sector size) and then trying to access it from there will give you the exact same problem.
Or no problem at all, if the specific file system or other usage can tolerate that change.

In general, do not mix IO backends of different characteristics.

Specifically, for the mentioned issue, do not mix backends with different "logical_block_size".

Note that "hw_sector_size" is an alias to "logical_block_size", NOT physical_block_size.

As long as logical_block_size is the same, physical_block_size may be different, some translation layer below DRBD will do the work.

For differing logical block sizes,
what DRBD currently does (and that did not change):

DRBD does communicate these block device queue topology settings during a handshake.
- ignores the "physical" size
- checks the "logical" size
  if logical sizes do not match, we
    - complain: "logical block sizes do not match (me:%u, peer:%u); this may cause problems."
    - we disable "write same" support,
      because any forwarded "write same"
      request with mismatched logical size would trigger a BUG_ON in the
      scsi layer.
    - if we are currently Secondary (unused),
      we adjust the logical block of our queue
    - if we are currently Primary (in use), we must not change the block
      size, so we don't

What we will not do is "translate".

Yes, that means you end up with different logical block sizes on DRBD depending on the specific order "attach", "connect" and "promote".
 :-(

If you manage to always first connect all nodes,
*before* you start to use them, they should all agree on the maximum logical block size.

Even though now DRBD may pretend to have logical block size larger than its local backend, you may then get away with using it, as long as you only do regular read/write.  Or it may break in creative ways later.

What we could easily do is refuse to connect, if peers detect different logical block sizes on the backends.  Does not feel right, and would prevent a "rolling migration" to new storage systems.

What we will NOT do: implement a "translation" layer (read-modify-write when pretending to support 512 while the backend is 4k).

If you need that, implement that below DRBD.  Your device driver (or storage backend) may have an option to enable such a translation mode.
Maybe that has changed, and is no longer enabled?

What we could potentially do: "remember" (or explicitly set) a target block size in meta data, detect that during attach, and apply it to our queue parameters.

What that would give us is again a (potential) logical block mismatch between DRBD and its respective local backends.
Which, as I already pointed out, will lead to problems when something more interesting than a regular read/write is to be processed.

So we decided to not do that.

If you have a file system or other use on top of DRBD that can not tolerate a change in logical block size from one "mount" to the next, then make sure to use IO backends with identical (or similar enough) characteristics.

If you have a file system that can tolerate such a change, you may get away with formating for 4k, and using it from both
512 and 4k backends.

But keep in mind that DRBD does NOT do any translation, so again: anything more "interesting" than regular read/write may still lead to problems when DRBD relays that to the respective backend.

--
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker

DRBD(r) and LINBIT(r) are registered trademarks of LINBIT __ please don't Cc me, but send to list -- I'm subscribed _______________________________________________
Star us on GITHUB: https://github.com/LINBIT drbd-user mailing list drbd-user at lists.linbit.com https://lists.linbit.com/mailman/listinfo/drbd-user