Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello,
I manage a pair of servers in two separate datacenters that replicate a device (backing a MySQL database) between the two.
Knowing that there was a storage issue in one of the datacenters that meant that I/O was sometimes blocking for a few minutes, I set the replication protocol to A.
That way (or so I believed), when the storage felt like acting up in our secondary datacenter the worse that could happen was that replication would lag, or even fail for a few minutes and then reconnect and catch up.
Last Thursday we experienced a situation when the storage problem did crop up, drbd connection between the two DCs was broken and then replication restarted.
The problem I faced was that during that replication, I/O latency seems to have been downgraded on both primary and secondary servers, just like I would have expected (and had previously experienced) using protocol C.
Note that I/O was still slowed down in my secondary DC while the resync was taking place.
It’s as though, while replication is asynchronous, resynchronizations are actually synchronous.
Am I missing something?
How can I avoid resyncs dragging down my I/O latency on the primary if the secondary is dragging it’s feet?
Some details and logs follow.
This is a pair of CentOS 6 servers, running drbd 8.4.5 on a 2.6.32-573.18.1 kernel.
I am running with the following common configuration:
common {
protocol A;
handlers {
}
startup {
degr-wfc-timeout 120;
}
syncer {
rate 500M;
al-extents 257;
}
options {
}
disk {
on-io-error detach;
}
net {
verify-alg md5;
}
}
This is what I have in the logs for the primary server:
Jun 23 15:53:18 pf-db1 kernel: block drbd0: Remote failed to finish a request within ko-count * timeout
Jun 23 15:53:18 pf-db1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: asender terminated
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Terminating drbd_a_mysql
Jun 23 15:53:18 pf-db1 kernel: block drbd0: new current UUID C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Connection closed
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Timeout -> Unconnected )
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver terminated
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Restarting receiver thread
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver (re)started
Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Unconnected -> WFConnection )
Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Handshake successful: Agreed network protocol version 101
Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Agreed to support TRIM on protocol level
Jun 23 15:53:22 pf-db1 kernel: drbd mysql: conn( WFConnection -> WFReportParams )
Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Starting asender thread (from drbd_r_mysql [20164])
Jun 23 15:53:22 pf-db1 kernel: block drbd0: drbd_sync_handshake:
Jun 23 15:53:22 pf-db1 kernel: block drbd0: self C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD bits:1151 flags:0
Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer 5386F0E7BDD5DE26:0000000000000000:A561B9F4409AEACC:A560B9F4409AEACD bits:0 flags:0
Jun 23 15:53:22 pf-db1 kernel: block drbd0: uuid_compare()=1 by rule 70
Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
Jun 23 15:53:22 pf-db1 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%
Jun 23 15:53:25 pf-db1 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%
Jun 23 15:53:25 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
Jun 23 15:53:26 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
Jun 23 15:53:26 pf-db1 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
Jun 23 15:53:26 pf-db1 kernel: block drbd0: Began resync as SyncSource (will sync 6952 KB [1738 bits set]).
Jun 23 15:53:26 pf-db1 kernel: block drbd0: updated sync UUID C7BC9B26D8B411FB:5387F0E7BDD5DE27:5386F0E7BDD5DE27:A561B9F4409AEACD
Jun 23 15:53:38 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6
Jun 23 15:54:27 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6
Jun 23 15:54:33 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5
Jun 23 15:54:47 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4
Jun 23 15:55:03 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6
Jun 23 15:55:09 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5
Jun 23 15:55:15 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4
Jun 23 15:58:37 pf-db1 kernel: block drbd0: Resync done (total 310 sec; paused 0 sec; 20 K/sec)
Jun 23 15:58:37 pf-db1 kernel: block drbd0: updated UUIDs C7BC9B26D8B411FB:0000000000000000:5387F0E7BDD5DE27:5386F0E7BDD5DE27
Jun 23 15:58:37 pf-db1 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
This is sar reporting on block devices while the problem was happening:
# sar -f /var/log/sa/sa23 -dp -s 15:43:00 -e 16:09:00 | awk '$3 ~ /DEV/ || $3 ~ /drbd0/'
03:43:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:44:01 PM drbd0 51.60 0.00 1601.61 31.04 0.04 0.83 0.19 0.99
03:45:01 PM drbd0 21.44 0.00 571.93 26.68 1.14 51.26 40.69 87.25
03:46:01 PM drbd0 17.72 0.00 406.49 22.94 1.20 65.60 53.99 95.65
03:47:01 PM drbd0 13.04 0.00 217.15 16.66 1.43 113.87 76.94 100.31
03:48:01 PM drbd0 5.16 0.00 85.15 16.49 1.30 244.81 191.23 98.73
03:49:02 PM drbd0 6.88 0.00 97.98 14.23 1.30 193.08 145.69 100.29
03:50:01 PM drbd0 10.53 0.00 180.82 17.18 1.51 120.57 95.35 100.38
03:51:01 PM drbd0 6.20 0.00 96.33 15.55 1.47 279.11 161.82 100.26
03:52:01 PM drbd0 9.71 0.00 183.82 18.92 1.36 118.30 103.47 100.51
03:53:01 PM drbd0 2.43 0.00 37.73 15.56 1.94 232.62 414.01 100.40
03:54:01 PM drbd0 12.55 0.00 398.39 31.75 8.01 736.01 69.56 87.28
03:55:01 PM drbd0 3.06 0.00 29.31 9.57 128.10 675.81 328.01 100.43
03:56:01 PM drbd0 1.34 0.00 39.36 29.40 43.34 126599.23 750.34 100.45
03:57:01 PM drbd0 5.36 0.00 128.97 24.05 1.60 258.42 186.98 100.27
03:58:01 PM drbd0 14.02 0.00 287.79 20.53 3.30 264.49 71.71 100.52
03:59:01 PM drbd0 23.30 0.00 654.19 28.08 2.32 103.39 42.16 98.23
04:00:01 PM drbd0 40.33 0.00 1390.08 34.47 3.56 88.24 19.73 79.58
04:01:01 PM drbd0 37.54 0.00 1225.94 32.66 3.62 96.50 22.01 82.64
04:02:01 PM drbd0 47.52 0.00 1602.41 33.72 2.41 50.75 8.77 41.68
04:03:01 PM drbd0 60.49 0.00 2000.13 33.07 1.72 28.36 3.88 23.45
04:04:01 PM drbd0 67.02 0.00 2297.37 34.28 1.01 15.12 1.40 9.39
04:05:01 PM drbd0 59.32 0.00 1996.30 33.65 0.89 15.09 1.82 10.77
04:06:01 PM drbd0 58.68 0.00 2023.81 34.49 1.55 26.38 3.58 20.99
04:07:01 PM drbd0 57.96 0.00 1989.85 34.33 0.98 16.98 1.71 9.92
04:08:01 PM drbd0 55.01 0.00 1872.40 34.03 0.99 17.99 2.14 11.80
Any help is greatly appreciated.
Best regards,
--
Louis Munro
lmunro at inverse.ca :: www.inverse.ca
+1.514.447.4918 x125 :: +1 (866) 353-6153 x125
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence (www.packetfence.org)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20160627/3d1237a4/attachment.htm>