Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, I manage a pair of servers in two separate datacenters that replicate a device (backing a MySQL database) between the two. Knowing that there was a storage issue in one of the datacenters that meant that I/O was sometimes blocking for a few minutes, I set the replication protocol to A. That way (or so I believed), when the storage felt like acting up in our secondary datacenter the worse that could happen was that replication would lag, or even fail for a few minutes and then reconnect and catch up. Last Thursday we experienced a situation when the storage problem did crop up, drbd connection between the two DCs was broken and then replication restarted. The problem I faced was that during that replication, I/O latency seems to have been downgraded on both primary and secondary servers, just like I would have expected (and had previously experienced) using protocol C. Note that I/O was still slowed down in my secondary DC while the resync was taking place. It’s as though, while replication is asynchronous, resynchronizations are actually synchronous. Am I missing something? How can I avoid resyncs dragging down my I/O latency on the primary if the secondary is dragging it’s feet? Some details and logs follow. This is a pair of CentOS 6 servers, running drbd 8.4.5 on a 2.6.32-573.18.1 kernel. I am running with the following common configuration: common { protocol A; handlers { } startup { degr-wfc-timeout 120; } syncer { rate 500M; al-extents 257; } options { } disk { on-io-error detach; } net { verify-alg md5; } } This is what I have in the logs for the primary server: Jun 23 15:53:18 pf-db1 kernel: block drbd0: Remote failed to finish a request within ko-count * timeout Jun 23 15:53:18 pf-db1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Jun 23 15:53:18 pf-db1 kernel: drbd mysql: asender terminated Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Terminating drbd_a_mysql Jun 23 15:53:18 pf-db1 kernel: block drbd0: new current UUID C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Connection closed Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Timeout -> Unconnected ) Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver terminated Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Restarting receiver thread Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver (re)started Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Unconnected -> WFConnection ) Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Handshake successful: Agreed network protocol version 101 Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Agreed to support TRIM on protocol level Jun 23 15:53:22 pf-db1 kernel: drbd mysql: conn( WFConnection -> WFReportParams ) Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Starting asender thread (from drbd_r_mysql [20164]) Jun 23 15:53:22 pf-db1 kernel: block drbd0: drbd_sync_handshake: Jun 23 15:53:22 pf-db1 kernel: block drbd0: self C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD bits:1151 flags:0 Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer 5386F0E7BDD5DE26:0000000000000000:A561B9F4409AEACC:A560B9F4409AEACD bits:0 flags:0 Jun 23 15:53:22 pf-db1 kernel: block drbd0: uuid_compare()=1 by rule 70 Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent ) Jun 23 15:53:22 pf-db1 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0% Jun 23 15:53:25 pf-db1 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0% Jun 23 15:53:25 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 Jun 23 15:53:26 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) Jun 23 15:53:26 pf-db1 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) Jun 23 15:53:26 pf-db1 kernel: block drbd0: Began resync as SyncSource (will sync 6952 KB [1738 bits set]). Jun 23 15:53:26 pf-db1 kernel: block drbd0: updated sync UUID C7BC9B26D8B411FB:5387F0E7BDD5DE27:5386F0E7BDD5DE27:A561B9F4409AEACD Jun 23 15:53:38 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6 Jun 23 15:54:27 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6 Jun 23 15:54:33 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5 Jun 23 15:54:47 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4 Jun 23 15:55:03 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6 Jun 23 15:55:09 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5 Jun 23 15:55:15 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4 Jun 23 15:58:37 pf-db1 kernel: block drbd0: Resync done (total 310 sec; paused 0 sec; 20 K/sec) Jun 23 15:58:37 pf-db1 kernel: block drbd0: updated UUIDs C7BC9B26D8B411FB:0000000000000000:5387F0E7BDD5DE27:5386F0E7BDD5DE27 Jun 23 15:58:37 pf-db1 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) This is sar reporting on block devices while the problem was happening: # sar -f /var/log/sa/sa23 -dp -s 15:43:00 -e 16:09:00 | awk '$3 ~ /DEV/ || $3 ~ /drbd0/' 03:43:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 03:44:01 PM drbd0 51.60 0.00 1601.61 31.04 0.04 0.83 0.19 0.99 03:45:01 PM drbd0 21.44 0.00 571.93 26.68 1.14 51.26 40.69 87.25 03:46:01 PM drbd0 17.72 0.00 406.49 22.94 1.20 65.60 53.99 95.65 03:47:01 PM drbd0 13.04 0.00 217.15 16.66 1.43 113.87 76.94 100.31 03:48:01 PM drbd0 5.16 0.00 85.15 16.49 1.30 244.81 191.23 98.73 03:49:02 PM drbd0 6.88 0.00 97.98 14.23 1.30 193.08 145.69 100.29 03:50:01 PM drbd0 10.53 0.00 180.82 17.18 1.51 120.57 95.35 100.38 03:51:01 PM drbd0 6.20 0.00 96.33 15.55 1.47 279.11 161.82 100.26 03:52:01 PM drbd0 9.71 0.00 183.82 18.92 1.36 118.30 103.47 100.51 03:53:01 PM drbd0 2.43 0.00 37.73 15.56 1.94 232.62 414.01 100.40 03:54:01 PM drbd0 12.55 0.00 398.39 31.75 8.01 736.01 69.56 87.28 03:55:01 PM drbd0 3.06 0.00 29.31 9.57 128.10 675.81 328.01 100.43 03:56:01 PM drbd0 1.34 0.00 39.36 29.40 43.34 126599.23 750.34 100.45 03:57:01 PM drbd0 5.36 0.00 128.97 24.05 1.60 258.42 186.98 100.27 03:58:01 PM drbd0 14.02 0.00 287.79 20.53 3.30 264.49 71.71 100.52 03:59:01 PM drbd0 23.30 0.00 654.19 28.08 2.32 103.39 42.16 98.23 04:00:01 PM drbd0 40.33 0.00 1390.08 34.47 3.56 88.24 19.73 79.58 04:01:01 PM drbd0 37.54 0.00 1225.94 32.66 3.62 96.50 22.01 82.64 04:02:01 PM drbd0 47.52 0.00 1602.41 33.72 2.41 50.75 8.77 41.68 04:03:01 PM drbd0 60.49 0.00 2000.13 33.07 1.72 28.36 3.88 23.45 04:04:01 PM drbd0 67.02 0.00 2297.37 34.28 1.01 15.12 1.40 9.39 04:05:01 PM drbd0 59.32 0.00 1996.30 33.65 0.89 15.09 1.82 10.77 04:06:01 PM drbd0 58.68 0.00 2023.81 34.49 1.55 26.38 3.58 20.99 04:07:01 PM drbd0 57.96 0.00 1989.85 34.33 0.98 16.98 1.71 9.92 04:08:01 PM drbd0 55.01 0.00 1872.40 34.03 0.99 17.99 2.14 11.80 Any help is greatly appreciated. Best regards, -- Louis Munro lmunro at inverse.ca :: www.inverse.ca +1.514.447.4918 x125 :: +1 (866) 353-6153 x125 Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160627/3d1237a4/attachment.htm>