[DRBD-user] Unexpected disconnects followed by drbd-proxy segfault

Tue Feb 2 07:44:46 CET 2010

I just rolled out our 3-node config with one node off-site connected
via drbd-proxy.  The two sites are each behind firewalls, so the
remote connection is via an openvpn tunnel.  Typical latency through
the tunnel is 2-3 msec.  A couple hours after the initial resync, the
primary logged:

Feb  1 23:40:51 axion kernel: [244623.159791] block drbd10: PingAck
did not arrive in time.
Feb  1 23:40:51 axion kernel: [244623.159846] block drbd10: peer(
Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk(
UpToDate -> DUnknown )

And began a resync.  About an hour into the resync, another missed
PingAck was logged.  These log entries repeated several times:

Feb  2 00:11:49 axion kernel: [246478.860538] block drbd10: short read
expecting header on sock: r=-512
Feb  2 00:11:49 axion kernel: [246478.860959] block drbd10: Connection closed
Feb  2 00:11:49 axion kernel: [246478.860965] block drbd10: conn(
NetworkFailure -> Unconnected )
Feb  2 00:11:49 axion kernel: [246478.860969] block drbd10: receiver terminated
Feb  2 00:11:49 axion kernel: [246478.860970] block drbd10: Restarting
receiver thread
Feb  2 00:11:49 axion kernel: [246478.860973] block drbd10: receiver (re)started
Feb  2 00:11:49 axion kernel: [246478.860977] block drbd10: conn(
Unconnected -> WFConnection )
Feb  2 00:12:02 axion kernel: [246492.366630] block drbd10:
sock_recvmsg returned -11
Feb  2 00:12:02 axion kernel: [246492.366667] block drbd10: conn(
WFConnection -> BrokenPipe )
Feb  2 00:12:02 axion kernel: [246492.366674] block drbd10: short read
expecting header on sock: r=-11

Two minutes later, the remote nodes drbd-proxy segfaulted:

Feb  2 00:13:47 axino kernel: [369993.255858] drbd-proxy[9445]:
segfault at 0 rip 403cd4 rsp 7fffff745860 error 4

How high can I safely increase ping-timeout on the drbd10 resource?
It would certainly *seem* that 500ms should be fine, but something.

NICs are the standard Dell Broadcom, running the bnx2 driver v. 1.6.9.

Here's my drbd.conf:

global {
    usage-count yes;
}

resource www0 {
    protocol C;
    device     /dev/drbd0;
    meta-disk  internal;

    syncer { rate 90M; }

    net {
        cram-hmac-alg md5;
        shared-secret "kegHighOwn9OdvicJankapjegEmtOb";
    }

    on axion {
        disk       /dev/vg0/www_data;
        address    10.0.0.17:7788;
    }

    on hyperaxe {
        disk      /dev/vgdata/www_data;
        address   10.0.0.19:7788;
    }
}

resource dr-www0 {
    protocol A;

    syncer {
        csums-alg md5;
        rate 100K;
        after www0;
        use-rle;
    }

    proxy {
        compression on;
        memlimit 500M;
    }

    stacked-on-top-of www0 {
        device /dev/drbd10;
        address 127.0.0.1:7789;
        proxy on axion hyperaxe {
            inside 127.0.0.1:7788;
            outside 172.20.0.1:7788;
        }
    }

    on axino {
        device /dev/drbd0;
        disk /dev/vgdata/www_data;
        address 127.0.0.1:7789;
        meta-disk internal;
        proxy on axino {
            inside 127.0.0.1:7788;
            outside 172.20.0.2:7788;
        }
    }
}

(Yes, I know I'm missing cram-hmac-alg and shared-secret in the
stacked resource.  I'll be fixing that shortly.)