Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I just rolled out our 3-node config with one node off-site connected via drbd-proxy. The two sites are each behind firewalls, so the remote connection is via an openvpn tunnel. Typical latency through the tunnel is 2-3 msec. A couple hours after the initial resync, the primary logged: Feb 1 23:40:51 axion kernel: [244623.159791] block drbd10: PingAck did not arrive in time. Feb 1 23:40:51 axion kernel: [244623.159846] block drbd10: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) And began a resync. About an hour into the resync, another missed PingAck was logged. These log entries repeated several times: Feb 2 00:11:49 axion kernel: [246478.860538] block drbd10: short read expecting header on sock: r=-512 Feb 2 00:11:49 axion kernel: [246478.860959] block drbd10: Connection closed Feb 2 00:11:49 axion kernel: [246478.860965] block drbd10: conn( NetworkFailure -> Unconnected ) Feb 2 00:11:49 axion kernel: [246478.860969] block drbd10: receiver terminated Feb 2 00:11:49 axion kernel: [246478.860970] block drbd10: Restarting receiver thread Feb 2 00:11:49 axion kernel: [246478.860973] block drbd10: receiver (re)started Feb 2 00:11:49 axion kernel: [246478.860977] block drbd10: conn( Unconnected -> WFConnection ) Feb 2 00:12:02 axion kernel: [246492.366630] block drbd10: sock_recvmsg returned -11 Feb 2 00:12:02 axion kernel: [246492.366667] block drbd10: conn( WFConnection -> BrokenPipe ) Feb 2 00:12:02 axion kernel: [246492.366674] block drbd10: short read expecting header on sock: r=-11 Two minutes later, the remote nodes drbd-proxy segfaulted: Feb 2 00:13:47 axino kernel: [369993.255858] drbd-proxy[9445]: segfault at 0 rip 403cd4 rsp 7fffff745860 error 4 How high can I safely increase ping-timeout on the drbd10 resource? It would certainly *seem* that 500ms should be fine, but something. NICs are the standard Dell Broadcom, running the bnx2 driver v. 1.6.9. Here's my drbd.conf: global { usage-count yes; } resource www0 { protocol C; device /dev/drbd0; meta-disk internal; syncer { rate 90M; } net { cram-hmac-alg md5; shared-secret "kegHighOwn9OdvicJankapjegEmtOb"; } on axion { disk /dev/vg0/www_data; address 10.0.0.17:7788; } on hyperaxe { disk /dev/vgdata/www_data; address 10.0.0.19:7788; } } resource dr-www0 { protocol A; syncer { csums-alg md5; rate 100K; after www0; use-rle; } proxy { compression on; memlimit 500M; } stacked-on-top-of www0 { device /dev/drbd10; address 127.0.0.1:7789; proxy on axion hyperaxe { inside 127.0.0.1:7788; outside 172.20.0.1:7788; } } on axino { device /dev/drbd0; disk /dev/vgdata/www_data; address 127.0.0.1:7789; meta-disk internal; proxy on axino { inside 127.0.0.1:7788; outside 172.20.0.2:7788; } } } (Yes, I know I'm missing cram-hmac-alg and shared-secret in the stacked resource. I'll be fixing that shortly.)