<html><head><meta http-equiv="Content-Type" content="text/html charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;" class="">Hello,<div class=""><br class=""></div><div class="">I manage a pair of servers in two separate datacenters that replicate a device (backing a MySQL database) between the two.</div><div class=""><br class=""></div><div class="">Knowing that there was a storage issue in one of the datacenters that meant that I/O was sometimes blocking for a few minutes, I set the replication protocol to A.</div><div class="">That way (or so I believed), when the storage felt like acting up in our secondary datacenter the worse that could happen was that replication would lag, or even fail for a few minutes and then reconnect and catch up.</div><div class=""><br class=""></div><div class="">Last Thursday we experienced a situation when the storage problem did crop up, drbd connection between the two DCs was broken and then replication restarted.</div><div class="">The problem I faced was that during that replication, I/O latency seems to have been downgraded on both primary and secondary servers, just like I would have expected (and had previously experienced) using protocol C.</div><div class="">Note that I/O was still slowed down in my secondary DC while the resync was taking place.</div><div class=""><br class=""></div><div class="">It’s as though, while replication is asynchronous, resynchronizations are actually synchronous.</div><div class="">Am I missing something? </div><div class="">How can I avoid resyncs dragging down my I/O latency on the primary if the secondary is dragging it’s feet? </div><div class=""><br class=""></div><div class="">Some details and logs follow.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">This is a pair of CentOS 6 servers, running drbd 8.4.5 on a 2.6.32-573.18.1 kernel.</div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">I am running with the following common configuration: </div><div class=""><font face="Courier" class=""><br class=""></font></div><div class=""><font face="Courier" class="">common {<br class=""> protocol A;<br class=""><br class=""> handlers {<br class=""><br class=""> }<br class=""><br class=""> startup {<br class=""> degr-wfc-timeout 120;<br class=""> }<br class=""><br class=""> syncer {<br class=""> rate 500M;<br class=""> al-extents 257;<br class=""> }<br class=""> options {<br class=""> }<br class=""><br class=""> disk {<br class=""> on-io-error detach;<br class=""> }<br class=""><br class=""> net {</font></div><div class=""><span class="Apple-tab-span" style="white-space:pre">                </span>verify-alg md5;</div><div class=""><font face="Courier" class=""> }<br class="">}</font></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">This is what I have in the logs for the primary server: </div><div class=""><br class=""></div><div class=""><font face="Courier" class="">Jun 23 15:53:18 pf-db1 kernel: block drbd0: Remote failed to finish a request within ko-count * timeout<br class="">Jun 23 15:53:18 pf-db1 kernel: block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: asender terminated<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Terminating drbd_a_mysql<br class="">Jun 23 15:53:18 pf-db1 kernel: block drbd0: new current UUID C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Connection closed<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Timeout -> Unconnected )<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver terminated<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Restarting receiver thread<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver (re)started<br class="">Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Unconnected -> WFConnection )<br class="">Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Handshake successful: Agreed network protocol version 101<br class="">Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Agreed to support TRIM on protocol level<br class="">Jun 23 15:53:22 pf-db1 kernel: drbd mysql: conn( WFConnection -> WFReportParams )<br class="">Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Starting asender thread (from drbd_r_mysql [20164])<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: drbd_sync_handshake:<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: self C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD bits:1151 flags:0<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer 5386F0E7BDD5DE26:0000000000000000:A561B9F4409AEACC:A560B9F4409AEACD bits:0 flags:0<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: uuid_compare()=1 by rule 70<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )<br class="">Jun 23 15:53:22 pf-db1 kernel: block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%<br class="">Jun 23 15:53:25 pf-db1 kernel: block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%<br class="">Jun 23 15:53:25 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0<br class="">Jun 23 15:53:26 pf-db1 kernel: block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)<br class="">Jun 23 15:53:26 pf-db1 kernel: block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )<br class="">Jun 23 15:53:26 pf-db1 kernel: block drbd0: Began resync as SyncSource (will sync 6952 KB [1738 bits set]).<br class="">Jun 23 15:53:26 pf-db1 kernel: block drbd0: updated sync UUID C7BC9B26D8B411FB:5387F0E7BDD5DE27:5386F0E7BDD5DE27:A561B9F4409AEACD<br class="">Jun 23 15:53:38 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6<br class="">Jun 23 15:54:27 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6<br class="">Jun 23 15:54:33 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5<br class="">Jun 23 15:54:47 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4<br class="">Jun 23 15:55:03 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 6<br class="">Jun 23 15:55:09 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 5<br class="">Jun 23 15:55:15 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339] sock_sendmsg time expired, ko = 4<br class="">Jun 23 15:58:37 pf-db1 kernel: block drbd0: Resync done (total 310 sec; paused 0 sec; 20 K/sec)<br class="">Jun 23 15:58:37 pf-db1 kernel: block drbd0: updated UUIDs C7BC9B26D8B411FB:0000000000000000:5387F0E7BDD5DE27:5386F0E7BDD5DE27<br class="">Jun 23 15:58:37 pf-db1 kernel: block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )<br class=""><br class=""><br class="">This is sar reporting on block devices while the problem was happening: </font></div><div class=""><font face="Courier" class=""><br class=""></font></div><div class=""><font face="Courier" class=""># sar -f /var/log/sa/sa23 -dp -s 15:43:00 -e 16:09:00 | awk '$3 ~ /DEV/ || $3 ~ /drbd0/'<br class="">03:43:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util<br class="">03:44:01 PM drbd0 51.60 0.00 1601.61 31.04 0.04 0.83 0.19 0.99<br class="">03:45:01 PM drbd0 21.44 0.00 571.93 26.68 1.14 51.26 40.69 87.25<br class="">03:46:01 PM drbd0 17.72 0.00 406.49 22.94 1.20 65.60 53.99 95.65<br class="">03:47:01 PM drbd0 13.04 0.00 217.15 16.66 1.43 113.87 76.94 100.31<br class="">03:48:01 PM drbd0 5.16 0.00 85.15 16.49 1.30 244.81 191.23 98.73<br class="">03:49:02 PM drbd0 6.88 0.00 97.98 14.23 1.30 193.08 145.69 100.29<br class="">03:50:01 PM drbd0 10.53 0.00 180.82 17.18 1.51 120.57 95.35 100.38<br class="">03:51:01 PM drbd0 6.20 0.00 96.33 15.55 1.47 279.11 161.82 100.26<br class="">03:52:01 PM drbd0 9.71 0.00 183.82 18.92 1.36 118.30 103.47 100.51<br class="">03:53:01 PM drbd0 2.43 0.00 37.73 15.56 1.94 232.62 414.01 100.40<br class="">03:54:01 PM drbd0 12.55 0.00 398.39 31.75 8.01 736.01 69.56 87.28<br class="">03:55:01 PM drbd0 3.06 0.00 29.31 9.57 128.10 675.81 328.01 100.43<br class="">03:56:01 PM drbd0 1.34 0.00 39.36 29.40 43.34 126599.23 750.34 100.45<br class="">03:57:01 PM drbd0 5.36 0.00 128.97 24.05 1.60 258.42 186.98 100.27<br class="">03:58:01 PM drbd0 14.02 0.00 287.79 20.53 3.30 264.49 71.71 100.52<br class="">03:59:01 PM drbd0 23.30 0.00 654.19 28.08 2.32 103.39 42.16 98.23<br class="">04:00:01 PM drbd0 40.33 0.00 1390.08 34.47 3.56 88.24 19.73 79.58<br class="">04:01:01 PM drbd0 37.54 0.00 1225.94 32.66 3.62 96.50 22.01 82.64<br class="">04:02:01 PM drbd0 47.52 0.00 1602.41 33.72 2.41 50.75 8.77 41.68<br class="">04:03:01 PM drbd0 60.49 0.00 2000.13 33.07 1.72 28.36 3.88 23.45<br class="">04:04:01 PM drbd0 67.02 0.00 2297.37 34.28 1.01 15.12 1.40 9.39<br class="">04:05:01 PM drbd0 59.32 0.00 1996.30 33.65 0.89 15.09 1.82 10.77<br class="">04:06:01 PM drbd0 58.68 0.00 2023.81 34.49 1.55 26.38 3.58 20.99<br class="">04:07:01 PM drbd0 57.96 0.00 1989.85 34.33 0.98 16.98 1.71 9.92<br class="">04:08:01 PM drbd0 55.01 0.00 1872.40 34.03 0.99 17.99 2.14 11.80<br class=""></font><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class="">Any help is greatly appreciated.</div><div class=""><br class=""></div><div class="">Best regards,<br class=""><div class="">
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; " class=""><div style="color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; " class="">--</div><div style="color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; " class="">Louis Munro<br class=""><a href="mailto:lmunro@inverse.ca" class="">lmunro@inverse.ca</a> :: <a href="http://www.inverse.ca" class="">www.inverse.ca</a> <br class="">+1.514.447.4918 x125 :: +1 (866) 353-6153 x125<br class="">Inverse inc. :: Leaders behind SOGo (<a href="http://www.sogo.nu" class="">www.sogo.nu</a>) and PacketFence (<a href="http://www.packetfence.org" class="">www.packetfence.org</a>)</div></div>
</div>
<br class=""></div></body></html>