[DRBD-user] performance issues during a resync

Tue Jun 28 09:00:58 CEST 2016

Protocol A is defined like this:

Protocol A. Asynchronous replication protocol. Local write operations on
the primary node are considered completed as soon as the local disk
write has finished, and the replication packet has been placed in the
local TCP send buffer.

- https://www.drbd.org/en/doc/users-guide-84/s-replication-protocols

You probably want to look at DRBD proxy:

https://www.drbd.org/en/doc/users-guide-84/s-drbd-proxy

digimer

On 27/06/16 12:18 PM, Louis Munro wrote:
> Hello,
> 
> I manage a pair of servers in two separate datacenters that replicate a
> device (backing a MySQL database) between the two.
> 
> Knowing that there was a storage issue in one of the datacenters that
> meant that I/O was sometimes blocking for a few minutes, I set the
> replication protocol to A.
> That way (or so I believed), when the storage felt like acting up in our
> secondary datacenter the worse that could happen was that replication
> would lag, or even fail for a few minutes and then reconnect and catch up.
> 
> Last Thursday we experienced a situation when the storage problem did
> crop up, drbd connection between the two DCs was broken and then
> replication restarted.
> The problem I faced was that during that replication, I/O latency seems
> to have been downgraded on both primary and secondary servers, just like
> I would have expected (and had previously experienced) using protocol C.
> Note that I/O was still slowed down in my secondary DC while the resync
> was taking place.
> 
> It’s as though, while replication is asynchronous, resynchronizations
> are actually synchronous.
> Am I missing something? 
> How can I avoid resyncs dragging down my I/O latency on the primary if
> the secondary is dragging it’s feet? 
> 
> Some details and logs follow.
> 
> 
> This is a pair of CentOS 6 servers, running drbd 8.4.5 on
> a 2.6.32-573.18.1 kernel.
> 
> 
> I am running with the following common configuration: 
> 
> common {
>     protocol A;
> 
>     handlers {
> 
>     }
> 
>     startup {
>         degr-wfc-timeout 120;
>     }
> 
>     syncer {
>         rate 500M;
>         al-extents 257;
>     }
>     options {
>     }
> 
>     disk {
>         on-io-error     detach;
>     }
> 
>     net {
> verify-alg md5;
>     }
> }
> 
> 
> This is what I have in the logs for the primary server: 
> 
> Jun 23 15:53:18 pf-db1 kernel: block drbd0: Remote failed to finish a
> request within ko-count * timeout
> Jun 23 15:53:18 pf-db1 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: asender terminated
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Terminating drbd_a_mysql
> Jun 23 15:53:18 pf-db1 kernel: block drbd0: new current UUID
> C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Connection closed
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Timeout -> Unconnected )
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver terminated
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: Restarting receiver thread
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: receiver (re)started
> Jun 23 15:53:18 pf-db1 kernel: drbd mysql: conn( Unconnected ->
> WFConnection )
> Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Handshake successful: Agreed
> network protocol version 101
> Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Agreed to support TRIM on
> protocol level
> Jun 23 15:53:22 pf-db1 kernel: drbd mysql: conn( WFConnection ->
> WFReportParams )
> Jun 23 15:53:22 pf-db1 kernel: drbd mysql: Starting asender thread (from
> drbd_r_mysql [20164])
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: drbd_sync_handshake:
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: self
> C7BC9B26D8B411FB:5386F0E7BDD5DE27:A561B9F4409AEACD:A560B9F4409AEACD
> bits:1151 flags:0
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer
> 5386F0E7BDD5DE26:0000000000000000:A561B9F4409AEACC:A560B9F4409AEACD
> bits:0 flags:0
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: uuid_compare()=1 by rule 70
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> Consistent )
> Jun 23 15:53:22 pf-db1 kernel: block drbd0: send bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%
> Jun 23 15:53:25 pf-db1 kernel: block drbd0: receive bitmap stats
> [Bytes(packets)]: plain 0(0), RLE 459(1), total 459; compression: 100.0%
> Jun 23 15:53:25 pf-db1 kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-source minor-0
> Jun 23 15:53:26 pf-db1 kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
> Jun 23 15:53:26 pf-db1 kernel: block drbd0: conn( WFBitMapS ->
> SyncSource ) pdsk( Consistent -> Inconsistent )
> Jun 23 15:53:26 pf-db1 kernel: block drbd0: Began resync as SyncSource
> (will sync 6952 KB [1738 bits set]).
> Jun 23 15:53:26 pf-db1 kernel: block drbd0: updated sync UUID
> C7BC9B26D8B411FB:5387F0E7BDD5DE27:5386F0E7BDD5DE27:A561B9F4409AEACD
> Jun 23 15:53:38 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 6
> Jun 23 15:54:27 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 6
> Jun 23 15:54:33 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 5
> Jun 23 15:54:47 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 4
> Jun 23 15:55:03 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 6
> Jun 23 15:55:09 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 5
> Jun 23 15:55:15 pf-db1 kernel: drbd mysql: [drbd_w_mysql/7339]
> sock_sendmsg time expired, ko = 4
> Jun 23 15:58:37 pf-db1 kernel: block drbd0: Resync done (total 310 sec;
> paused 0 sec; 20 K/sec)
> Jun 23 15:58:37 pf-db1 kernel: block drbd0: updated UUIDs
> C7BC9B26D8B411FB:0000000000000000:5387F0E7BDD5DE27:5386F0E7BDD5DE27
> Jun 23 15:58:37 pf-db1 kernel: block drbd0: conn( SyncSource ->
> Connected ) pdsk( Inconsistent -> UpToDate )
> 
> 
> This is sar reporting on block devices while the problem was happening: 
> 
> # sar -f /var/log/sa/sa23 -dp -s 15:43:00 -e 16:09:00 | awk '$3 ~ /DEV/
> || $3 ~ /drbd0/'
> 03:43:01 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz
>     await     svctm     %util
> 03:44:01 PM     drbd0     51.60      0.00   1601.61     31.04      0.04
>      0.83      0.19      0.99
> 03:45:01 PM     drbd0     21.44      0.00    571.93     26.68      1.14
>     51.26     40.69     87.25
> 03:46:01 PM     drbd0     17.72      0.00    406.49     22.94      1.20
>     65.60     53.99     95.65
> 03:47:01 PM     drbd0     13.04      0.00    217.15     16.66      1.43
>    113.87     76.94    100.31
> 03:48:01 PM     drbd0      5.16      0.00     85.15     16.49      1.30
>    244.81    191.23     98.73
> 03:49:02 PM     drbd0      6.88      0.00     97.98     14.23      1.30
>    193.08    145.69    100.29
> 03:50:01 PM     drbd0     10.53      0.00    180.82     17.18      1.51
>    120.57     95.35    100.38
> 03:51:01 PM     drbd0      6.20      0.00     96.33     15.55      1.47
>    279.11    161.82    100.26
> 03:52:01 PM     drbd0      9.71      0.00    183.82     18.92      1.36
>    118.30    103.47    100.51
> 03:53:01 PM     drbd0      2.43      0.00     37.73     15.56      1.94
>    232.62    414.01    100.40
> 03:54:01 PM     drbd0     12.55      0.00    398.39     31.75      8.01
>    736.01     69.56     87.28
> 03:55:01 PM     drbd0      3.06      0.00     29.31      9.57    128.10
>    675.81    328.01    100.43
> 03:56:01 PM     drbd0      1.34      0.00     39.36     29.40     43.34
> 126599.23    750.34    100.45
> 03:57:01 PM     drbd0      5.36      0.00    128.97     24.05      1.60
>    258.42    186.98    100.27
> 03:58:01 PM     drbd0     14.02      0.00    287.79     20.53      3.30
>    264.49     71.71    100.52
> 03:59:01 PM     drbd0     23.30      0.00    654.19     28.08      2.32
>    103.39     42.16     98.23
> 04:00:01 PM     drbd0     40.33      0.00   1390.08     34.47      3.56
>     88.24     19.73     79.58
> 04:01:01 PM     drbd0     37.54      0.00   1225.94     32.66      3.62
>     96.50     22.01     82.64
> 04:02:01 PM     drbd0     47.52      0.00   1602.41     33.72      2.41
>     50.75      8.77     41.68
> 04:03:01 PM     drbd0     60.49      0.00   2000.13     33.07      1.72
>     28.36      3.88     23.45
> 04:04:01 PM     drbd0     67.02      0.00   2297.37     34.28      1.01
>     15.12      1.40      9.39
> 04:05:01 PM     drbd0     59.32      0.00   1996.30     33.65      0.89
>     15.09      1.82     10.77
> 04:06:01 PM     drbd0     58.68      0.00   2023.81     34.49      1.55
>     26.38      3.58     20.99
> 04:07:01 PM     drbd0     57.96      0.00   1989.85     34.33      0.98
>     16.98      1.71      9.92
> 04:08:01 PM     drbd0     55.01      0.00   1872.40     34.03      0.99
>     17.99      2.14     11.80
> 
> 
> 
> Any help is greatly appreciated.
> 
> Best regards,
> --
> Louis Munro
> lmunro at inverse.ca <mailto:lmunro at inverse.ca>  ::  www.inverse.ca
> <http://www.inverse.ca> 
> +1.514.447.4918 x125  :: +1 (866) 353-6153 x125
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu>)
> and PacketFence (www.packetfence.org <http://www.packetfence.org>)
> 
> 
> 
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
> 

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?