[DRBD-user] DRBD stalls reproducibly on every "drbdadm verify"

Mon Jan 6 12:49:31 CET 2014

Hello

DRBD stalls reproducibly whenever I do a "drbdadm verify". It runs 
a couple of minutes and then suddenly seems to loose its connection:

Jan  6 10:52:54 vm-office03 kernel: block drbd0: conn( Connected -> VerifyS ) 
Jan  6 10:52:54 vm-office03 kernel: block drbd0: Starting Online Verify from sector 813355536
Jan  6 10:52:54 vm-office03 kernel: block drbd0: Out of sync: start=813367040, size=1280 (sectors)
Jan  6 10:52:56 vm-office03 kernel: block drbd0: Out of sync: start=813526920, size=376 (sectors)
Jan  6 10:52:57 vm-office03 kernel: block drbd0: Out of sync: start=813582592, size=472 (sectors)
   ...
Jan  6 10:58:00 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967295
Jan  6 10:58:06 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967294
Jan  6 10:58:12 vm-office03 pvestatd[14442]: WARNING: command 'df -P -B 1 /mnt/pve/nfs-store1' failed: got timeout
Jan  6 10:58:12 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967293
Jan  6 10:58:18 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967292
Jan  6 10:58:24 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967291
Jan  6 10:58:30 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967290
Jan  6 10:58:36 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967289
...

(I noticed that the ko value is 2^32-1, what does it mean?)

Any ideas what might cause this? There are no further kernel messages nor 
can I see any interface errors on the two "e1000e" physical interfaces
that are bonded together for the 10.111.222.0/24 net. 

I successfully run "drbdadm invalidate" to re-sync the cluster several 
times and never experienced problems during normal usage (with very light
disk I/O though).

My setup consists of two Fujitsu Siemens RX200 S4 server which run DRBD 
in primary/primary mode for a "Proxmox VE" cluster. Two servers are 
connected via cross-cable and only one actually mounts the drbd device 
and exports it as NFS volume. All access to the device is thus made via 
NFS by the current NFS master node.

The nodes are running Kernel 2.6.32-26-pve with DRBD 8.3.13.

The DRBD config is as follows:

root at vm-office03:/home/chammers# egrep -v '^[[:space:]]*#' /etc/drbd.d/drbd0.res 
resource drbd0 {
    protocol C;

    startup {
        become-primary-on both;
    }

    on vm-office03 {
        device    /dev/drbd0;
        disk      /dev/sda3;
        address   10.111.222.1:7789;
        meta-disk internal;
    }

    on vm-office04 {
        device    /dev/drbd0;
        disk      /dev/sda3;
        address   10.111.222.2:7789;
        meta-disk internal;
    }

    disk {
        no-disk-barrier;
    }

    net {
        cram-hmac-alg sha1;
        shared-secret "drbd0proxmox";
        data-integrity-alg crc32c;
        allow-two-primaries;
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;

    }

    syncer {
        rate 40M;
        csums-alg crc32c;
        verify-alg md5;
    }
}

root at vm-office04:/home/chammers# grep -v '^[[:space:]]*#' /etc/drbd.d/drbd0.res 
resource drbd0 {
    protocol C;

    startup {
        become-primary-on both;
    }

    on vm-office03 {
        device    /dev/drbd0;
        disk      /dev/sda3;
        address   10.111.222.1:7789;
        meta-disk internal;
    }

    on vm-office04 {
        device    /dev/drbd0;
        disk      /dev/sda3;
        address   10.111.222.2:7789;
        meta-disk internal;
    }

    disk {
        no-disk-barrier;
    }

    net {
        cram-hmac-alg sha1;
        shared-secret "drbd0proxmox";
        data-integrity-alg crc32c;
        allow-two-primaries;
        after-sb-0pri discard-zero-changes;
        after-sb-1pri discard-secondary;
        after-sb-2pri disconnect;

    }

    syncer {
        rate 40M;
        csums-alg crc32c;
        verify-alg md5;
    }

Best Regards

-christian-