Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I just rolled out our 3-node config with one node off-site connected
via drbd-proxy. The two sites are each behind firewalls, so the
remote connection is via an openvpn tunnel. Typical latency through
the tunnel is 2-3 msec. A couple hours after the initial resync, the
primary logged:
Feb 1 23:40:51 axion kernel: [244623.159791] block drbd10: PingAck
did not arrive in time.
Feb 1 23:40:51 axion kernel: [244623.159846] block drbd10: peer(
Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
UpToDate -> DUnknown )
And began a resync. About an hour into the resync, another missed
PingAck was logged. These log entries repeated several times:
Feb 2 00:11:49 axion kernel: [246478.860538] block drbd10: short read
expecting header on sock: r=-512
Feb 2 00:11:49 axion kernel: [246478.860959] block drbd10: Connection closed
Feb 2 00:11:49 axion kernel: [246478.860965] block drbd10: conn(
NetworkFailure -> Unconnected )
Feb 2 00:11:49 axion kernel: [246478.860969] block drbd10: receiver terminated
Feb 2 00:11:49 axion kernel: [246478.860970] block drbd10: Restarting
receiver thread
Feb 2 00:11:49 axion kernel: [246478.860973] block drbd10: receiver (re)started
Feb 2 00:11:49 axion kernel: [246478.860977] block drbd10: conn(
Unconnected -> WFConnection )
Feb 2 00:12:02 axion kernel: [246492.366630] block drbd10:
sock_recvmsg returned -11
Feb 2 00:12:02 axion kernel: [246492.366667] block drbd10: conn(
WFConnection -> BrokenPipe )
Feb 2 00:12:02 axion kernel: [246492.366674] block drbd10: short read
expecting header on sock: r=-11
Two minutes later, the remote nodes drbd-proxy segfaulted:
Feb 2 00:13:47 axino kernel: [369993.255858] drbd-proxy[9445]:
segfault at 0 rip 403cd4 rsp 7fffff745860 error 4
How high can I safely increase ping-timeout on the drbd10 resource?
It would certainly *seem* that 500ms should be fine, but something.
NICs are the standard Dell Broadcom, running the bnx2 driver v. 1.6.9.
Here's my drbd.conf:
global {
usage-count yes;
}
resource www0 {
protocol C;
device /dev/drbd0;
meta-disk internal;
syncer { rate 90M; }
net {
cram-hmac-alg md5;
shared-secret "kegHighOwn9OdvicJankapjegEmtOb";
}
on axion {
disk /dev/vg0/www_data;
address 10.0.0.17:7788;
}
on hyperaxe {
disk /dev/vgdata/www_data;
address 10.0.0.19:7788;
}
}
resource dr-www0 {
protocol A;
syncer {
csums-alg md5;
rate 100K;
after www0;
use-rle;
}
proxy {
compression on;
memlimit 500M;
}
stacked-on-top-of www0 {
device /dev/drbd10;
address 127.0.0.1:7789;
proxy on axion hyperaxe {
inside 127.0.0.1:7788;
outside 172.20.0.1:7788;
}
}
on axino {
device /dev/drbd0;
disk /dev/vgdata/www_data;
address 127.0.0.1:7789;
meta-disk internal;
proxy on axino {
inside 127.0.0.1:7788;
outside 172.20.0.2:7788;
}
}
}
(Yes, I know I'm missing cram-hmac-alg and shared-secret in the
stacked resource. I'll be fixing that shortly.)