[DRBD-user] Warning: If using drbd; data-loss / corruption is possible; PLEASE FIX IT !

Mon Aug 14 22:09:06 CEST 2017

Dear DRBD-Developers, dear DRBD-Users,

Actually I would be very fond of DRBD -- But unfortunately I had somtimes
data-losses (rarely, but I had them).

FOR DEVELOPERS AND USERS:

DRBD-Versions concerned: 9.0.7-1, 9.0.8-1, 9.0.9rc1-1 . "THE VERSIONS"

I think the following configuration options are mandatory to have these
data losses:
net {
    congestion-fill  "1";    # 1 sector
    on-congestion    "pull-ahead";
    protocol         "A";
    [... (other options)]
}
(the goal of these settings: a very slow network-connection should not
 slow down the local disk-io.)

The data-losses showed as follows:
* Having a local drbd-device and its (one / only) peer.
* The local drbd-device is set to "primary", the peer-device to "secondary".
* The local drbd-device is mounted as ext3-filesystem.
* Writing some files to the mounted drbd-device (totally e.g. ca. 50GB).
* When the peer-device has been synchronized with the local device, some
  sectors have not been transferred to the peer device (about e.g. ca. 4MB).
  This fact has been verified by md5summing the files before writing 'to
  the drbd-device'; then unmounting the local drbd-device, switching it to
  "secondary" and switching the peer-device to "primary", mounting the
  peer-device as ext3-filesytem and verifying the checksums of the files
  on it.
* For the following is also essential: I have set 'csums-alg  "md5"', so
  only sectors which are not equal on the local device and the peer-device
  are transferred.
* Also when after a such data-loss, I invalidate the peer-device and force a
  total resynchronization of the peer-device with the local-device, these
  missing sectors 'are shown' in the status display as (ca.)
 "received: 4133". (So I know, that about 4MB are missing. [only the
  missing sectors are transferred, see above.])

MOSTLY FOR DEVELOPERS (AND INTERESTED USERS):

* Every time when I have had such a data-loss, I have seen in the system-
  messages the following message:
  "[...] SyncSource still sees bits set!! FIXME". (But not every time when
  I have seen this message I have had a data-loss.)
  (Source Code, "drbd/drbd_receiver.c"; Version 9.0.7-1: lines# 6258 ff.;
   Version 9.0.8-1: lines# 6249 ff.:
   "
     /* TODO: Since DRBD9 we experience that SyncSource still has
        bits set... NEED TO UNDERSTAND AND FIX! */
           if (drbd_bm_total_weight(peer_device) > peer_device->rs_failed)
         drbd_warn(peer_device, "SyncSource still sees bits set!! FIXME\n");
   ")

* I strongly suppose, that my data losses are related to those bits still
  set.
* In the following explanations, I refer to the Source Code of
  Version 9.0.7-1 .

* Supposed Explanation:
  * Suppose, according to the warning mentioned above, some bits of the
    bitmap are and stay set.
  * In "drbd/drbd_req.c::drbd_process_write_request()" at line# 1404
    (because of my 'mandatory settings') sooner or later
    "drbd_should_do_remote()" returns "false"; so "remote == false".
  * So, at line# 1422 the following branch is taken:
    "} else if (drbd_set_out_of_sync([...]))
       _req_mod(req, QUEUE_FOR_SEND_OOS, peer_device);
     }"
  * If the request "req" now is for writing to sectors, for which the
    bits in the bitmap are already set, "drbd_set_out_of_sync()" returns
    (a count of) "0" (because already set bits are not counted by the
    functions called by "drbd_set_out_of_sync()".
  * So the branch "_req_mod(req, QUEUE_FOR_SEND_OOS, peer_device);" is not
    taken; and the sectors will never be transmitted to the peer?

* I also observed the following (for all "THE VERSIONS"):
  * I have a test-resource with equivalent settings to the devices, where
    I have had data-losses (local-device: primary, peer-device: secondary);
    without filesystem.
  * I write by "dd  if=/dev/urandom of=/dev/drbd6 bs=4096 count=1000"
    to the local device "/dev/drbd6".
  * Then when the writing has finished, I got (among others) the following
    status-messages for the local-device:
    "received:0 sent:1564 out-of-sync:4000 pending:0 unacked: 0"
  * The "out-of-sync"-count remains (the local and the peer device are
    never synchronized).
  * After a "drbdadm down [...]"-"drbdadm up [...]" sequence the "out-of
    sync-count" is still "4000". (Why do 'you' not synchronize automati-
    cally two 'corresponding' devices out of sync after restarting?)
  * When I write out the metadata of the local device, I get the following:
    "
     [...]
     bitmap[0] {
     # at 0kB
      12 times 0xFFFFFFFFFFFFFFFF;
      0xFFFFFFFFFFFFFFFF; 0xFFFFFFFFFFFFFFFF; 0xFFFFFFFFFFFFFFFF;
      0x000000FFFFFFFFFF;
      65520 times 0x0000000000000000;
     }
     # bits-set 1000;
    "; so 'the "4000" out of sync' are persistent.
  * When I write (with the 'the "4000" out of sync' set) by
    "dd if=/dev/urandom of=/dev/drbd6 bs=4096 count=1000" again to
    the local device "/dev/drbd6"; I get e.g (among others) the following
    status-messages for the local-device:
    "received:0 sent:1432 out-of-sync:4000 pending:0 unacked: 0", then
    * only the first "1432" KBs have been transferred to the peer-
      device.
    * the requests "req" of the not transferred sectors are only and only
      the requests, for which (at the corresponding branches)
     "drbd_set_out_of_sync()" returns (a count of) "0".

I am longing for a perfectly working DRBD,

Sincerely

Thomas Bruecker

---------------------------------------------------------------------------
Appendix:
* Proceedings to get the above propositions:
  I have logged for every request "req" 'passing through drbd' the start-
  sector# of the "bio" in the request. I have tagged the start-sector#
  according to 'the location / function', where I have logged the request
  / "bio"; e.g. a request where "drbd_set_out_of_sync()" returned (a count
  of) "0" produced e.g. the following output: "dcs2:72" (sector# 72).

* Configuration Files:

  * Devices with data-loss: "
    resource  EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.RESOURCE {

      net {
        congestion-fill  "1";
        csums-alg        "md5";
        on-congestion    "pull-ahead";
        protocol         "A";
        transport        "tcp";
      }

      on ico {
        address  ipv4 10.235.1.88:7789;
        node-id  1;
        volume 0 {
          device  "/dev/drbd0";
          disk  "/media/byUuid/
                          EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.N1.V0.BASE";
          meta-disk  "internal";
        }
      }

      on xxxxx {
        address  ipv4 192.168.250.6:7789;
        node-id  0;
        volume 0 {
          device  "/dev/drbd0";
          disk  "/media/byUuid/
                          EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.N0.V0.BASE";
          meta-disk  "internal";
        }
      }
    }
  "

  * Test-resource: "
    resource  EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.TEST3.RESOURCE {

      net {
        congestion-fill  "1";
        csums-alg        "md5";
        on-congestion    "pull-ahead";
        protocol         "A";
        transport        "tcp";
      }

      on xxxxx {
        address  ipv4 192.168.250.6:7792;
        node-id  0;
        volume 0 {
          device  "/dev/drbd6";
          disk    "/media/byUuid/
                    EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.TEST3.N0.V0.BASE";
          meta-disk  "internal";
        }
      }

      on build-centOS-6-x.int.thomas-r-bruecker.ch {
        address  ipv4 10.235.1.42:7792;
        node-id  1;
        volume 0 {
          device  "/dev/drbd6";
          disk  "/media/byUuid/
                    EF1C0E32-3CB0-11DB-B6E3-0000C00A45A9.TEST3.N1.V0.BASE";
          meta-disk  "internal";
        }
      }
    }
  "