Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Protocol A is defined like this: Protocol A. Asynchronous replication protocol. Local write operations on the primary node are considered completed as soon as the local disk write has finished, and the replication packet has been placed in the local TCP send buffer. - https://www.drbd.org/en/doc/users-guide-84/s-replication-protocols You probably want to look at DRBD proxy: https://www.drbd.org/en/doc/users-guide-84/s-drbd-proxy digimer On 27/06/16 12:18 PM, Louis Munro wrote: > Hello, > > I manage a pair of servers in two separate datacenters that replicate a > device (backing a MySQL database) between the two. > > Knowing that there was a storage issue in one of the datacenters that > meant that I/O was sometimes blocking for a few minutes, I set the > replication protocol to A. > That way (or so I believed), when the storage felt like acting up in our > secondary datacenter the worse that could happen was that replication > would lag, or even fail for a few minutes and then reconnect and catch up. > > Last Thursday we experienced a situation when the storage problem did > crop up, drbd connection between the two DCs was broken and then > replication restarted. > The problem I faced was that during that replication, I/O latency seems > to have been downgraded on both primary and secondary servers, just like > I would have expected (and had previously experienced) using protocol C. > Note that I/O was still slowed down in my secondary DC while the resync > was taking place. > > It’s as though, while replication is asynchronous, resynchronizations > are actually synchronous. > Am I missing something? > How can I avoid resyncs dragging down my I/O latency on the primary if > the secondary is dragging it’s feet? > > Some details and logs follow. > > > This is a pair of CentOS 6 servers, running drbd 8.4.5 on > a 2.6.32-573.18.1 kernel. > > > I am running with the following common configuration: > > common { > protocol A; > > handlers { > > } > > startup { > degr-wfc-timeout 120; > } > > syncer { > rate 500M; > al-extents 257; > } > options { > } > > disk { > on-io-error detach; > } > > net { > verify-alg md5; > } > } > > > This is what I have in the logs for the primary server: > > Jun 23 15:53:18 pf-db1 kernel: block drbd0: Remote failed to finish a > request within ko-count * timeout > Jun 23 15:53:18 pf-db1 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: asender terminated > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Terminating drbd_a_mysql > Jun 23 15:53:18 pf-db1 kernel: block drbd0: new current UUID > C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Connection closed > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Timeout -> Unconnected ) > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver terminated > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Restarting receiver thread > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver (re)started > Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Unconnected -> > WFConnection ) > Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Handshake successful: Agreed > network protocol version 101 > Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Agreed to support TRIM on > protocol level > Jun 23 15:53:22 pf-db1 kernel: drbd mysql: conn( WFConnection -> > WFReportParams ) > Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Starting asender thread (from > drbd_r_mysql [20164]) > Jun 23 15:53:22 pf-db1 kernel: block drbd0: drbd_sync_handshake: > Jun 23 15:53:22 pf-db1 kernel: block drbd0: self > C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD > bits:1151 flags:0 > Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer > 5386F0E7BDD5DE26:0000000000000000:A561B9F4409AEACC:A560B9F4409AEACD > bits:0 flags:0 > Jun 23 15:53:22 pf-db1 kernel: block drbd0: uuid_compare()=1 by rule 70 > Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) > Jun 23 15:53:22 pf-db1 kernel: block drbd0: send bitmap stats > [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0% > Jun 23 15:53:25 pf-db1 kernel: block drbd0: receive bitmap stats > [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0% > Jun 23 15:53:25 pf-db1 kernel: block drbd0: helper command: > /sbin/drbdadm before-resync-source minor-0 > Jun 23 15:53:26 pf-db1 kernel: block drbd0: helper command: > /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) > Jun 23 15:53:26 pf-db1 kernel: block drbd0: conn( WFBitMapS -> > SyncSource ) pdsk( Consistent -> Inconsistent ) > Jun 23 15:53:26 pf-db1 kernel: block drbd0: Began resync as SyncSource > (will sync 6952 KB [1738 bits set]). > Jun 23 15:53:26 pf-db1 kernel: block drbd0: updated sync UUID > C7BC9B26D8B411FB:5387F0E7BDD5DE27:5386F0E7BDD5DE27:A561B9F4409AEACD > Jun 23 15:53:38 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 6 > Jun 23 15:54:27 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 6 > Jun 23 15:54:33 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 5 > Jun 23 15:54:47 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 4 > Jun 23 15:55:03 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 6 > Jun 23 15:55:09 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 5 > Jun 23 15:55:15 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] > sock_sendmsg time expired, ko = 4 > Jun 23 15:58:37 pf-db1 kernel: block drbd0: Resync done (total 310 sec; > paused 0 sec; 20 K/sec) > Jun 23 15:58:37 pf-db1 kernel: block drbd0: updated UUIDs > C7BC9B26D8B411FB:0000000000000000:5387F0E7BDD5DE27:5386F0E7BDD5DE27 > Jun 23 15:58:37 pf-db1 kernel: block drbd0: conn( SyncSource -> > Connected ) pdsk( Inconsistent -> UpToDate ) > > > This is sar reporting on block devices while the problem was happening: > > # sar -f /var/log/sa/sa23 -dp -s 15:43:00 -e 16:09:00 | awk '$3 ~ /DEV/ > || $3 ~ /drbd0/' > 03:43:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz > await svctm %util > 03:44:01 PM drbd0 51.60 0.00 1601.61 31.04 0.04 > 0.83 0.19 0.99 > 03:45:01 PM drbd0 21.44 0.00 571.93 26.68 1.14 > 51.26 40.69 87.25 > 03:46:01 PM drbd0 17.72 0.00 406.49 22.94 1.20 > 65.60 53.99 95.65 > 03:47:01 PM drbd0 13.04 0.00 217.15 16.66 1.43 > 113.87 76.94 100.31 > 03:48:01 PM drbd0 5.16 0.00 85.15 16.49 1.30 > 244.81 191.23 98.73 > 03:49:02 PM drbd0 6.88 0.00 97.98 14.23 1.30 > 193.08 145.69 100.29 > 03:50:01 PM drbd0 10.53 0.00 180.82 17.18 1.51 > 120.57 95.35 100.38 > 03:51:01 PM drbd0 6.20 0.00 96.33 15.55 1.47 > 279.11 161.82 100.26 > 03:52:01 PM drbd0 9.71 0.00 183.82 18.92 1.36 > 118.30 103.47 100.51 > 03:53:01 PM drbd0 2.43 0.00 37.73 15.56 1.94 > 232.62 414.01 100.40 > 03:54:01 PM drbd0 12.55 0.00 398.39 31.75 8.01 > 736.01 69.56 87.28 > 03:55:01 PM drbd0 3.06 0.00 29.31 9.57 128.10 > 675.81 328.01 100.43 > 03:56:01 PM drbd0 1.34 0.00 39.36 29.40 43.34 > 126599.23 750.34 100.45 > 03:57:01 PM drbd0 5.36 0.00 128.97 24.05 1.60 > 258.42 186.98 100.27 > 03:58:01 PM drbd0 14.02 0.00 287.79 20.53 3.30 > 264.49 71.71 100.52 > 03:59:01 PM drbd0 23.30 0.00 654.19 28.08 2.32 > 103.39 42.16 98.23 > 04:00:01 PM drbd0 40.33 0.00 1390.08 34.47 3.56 > 88.24 19.73 79.58 > 04:01:01 PM drbd0 37.54 0.00 1225.94 32.66 3.62 > 96.50 22.01 82.64 > 04:02:01 PM drbd0 47.52 0.00 1602.41 33.72 2.41 > 50.75 8.77 41.68 > 04:03:01 PM drbd0 60.49 0.00 2000.13 33.07 1.72 > 28.36 3.88 23.45 > 04:04:01 PM drbd0 67.02 0.00 2297.37 34.28 1.01 > 15.12 1.40 9.39 > 04:05:01 PM drbd0 59.32 0.00 1996.30 33.65 0.89 > 15.09 1.82 10.77 > 04:06:01 PM drbd0 58.68 0.00 2023.81 34.49 1.55 > 26.38 3.58 20.99 > 04:07:01 PM drbd0 57.96 0.00 1989.85 34.33 0.98 > 16.98 1.71 9.92 > 04:08:01 PM drbd0 55.01 0.00 1872.40 34.03 0.99 > 17.99 2.14 11.80 > > > > Any help is greatly appreciated. > > Best regards, > -- > Louis Munro > lmunro at inverse.ca <mailto:lmunro at inverse.ca> :: www.inverse.ca > <http://www.inverse.ca> > +1.514.447.4918 x125 :: +1 (866) 353-6153 x125 > Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu>) > and PacketFence (www.packetfence.org <http://www.packetfence.org>) > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education?