[DRBD-user] DRBD stalls reproducibly on every "drbdadm verify"

Thu Jan 9 10:14:04 CET 2014

Hello

I found the following knowledge base article from 2011 which described my
exact problem: http://www.novell.com/support/kb/doc.php?id=7009306

The solution given there was to switch from a static to an adaptive sync
rate, especially when using very fast (gigabit) network interface cards.

For me, it seemed to work when switching from "rate 40M" to:
  syncer {
        c-plan-ahead 20;
        c-min-rate 1M;
        c-max-rate 300M;
        c-fill-target 2M;
        verify-alg md5;
  }

Before that I also tried to use a plain crossover interface instead of the
bonding one but that had no effect. Adjusting the other values recommended
in the KB article did work for me, too, but I changed them back to their
defaults to isolate the above mentioned as the real fix.

Any comments on this one? Is this a bug in DRBD?

Best regards

-christian-

On Mon, 6 Jan 2014 12:49:31 +0100
Christian Hammers <chammers at netcologne.de> wrote:

> Hello
> 
> DRBD stalls reproducibly whenever I do a "drbdadm verify". It runs 
> a couple of minutes and then suddenly seems to loose its connection:
> 
> Jan  6 10:52:54 vm-office03 kernel: block drbd0: conn( Connected -> VerifyS ) 
> Jan  6 10:52:54 vm-office03 kernel: block drbd0: Starting Online Verify from sector 813355536
> Jan  6 10:52:54 vm-office03 kernel: block drbd0: Out of sync: start=813367040, size=1280 (sectors)
> Jan  6 10:52:56 vm-office03 kernel: block drbd0: Out of sync: start=813526920, size=376 (sectors)
> Jan  6 10:52:57 vm-office03 kernel: block drbd0: Out of sync: start=813582592, size=472 (sectors)
>    ...
> Jan  6 10:58:00 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967295
> Jan  6 10:58:06 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967294
> Jan  6 10:58:12 vm-office03 pvestatd[14442]: WARNING: command 'df -P -B 1 /mnt/pve/nfs-store1' failed: got timeout
> Jan  6 10:58:12 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967293
> Jan  6 10:58:18 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967292
> Jan  6 10:58:24 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967291
> Jan  6 10:58:30 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967290
> Jan  6 10:58:36 vm-office03 kernel: block drbd0: [drbd0_worker/3154] sock_sendmsg time expired, ko = 4294967289
> ...
> 
> (I noticed that the ko value is 2^32-1, what does it mean?)
> 
> Any ideas what might cause this? There are no further kernel messages nor 
> can I see any interface errors on the two "e1000e" physical interfaces
> that are bonded together for the 10.111.222.0/24 net. 
> 
> I successfully run "drbdadm invalidate" to re-sync the cluster several 
> times and never experienced problems during normal usage (with very light
> disk I/O though).
> 
> My setup consists of two Fujitsu Siemens RX200 S4 server which run DRBD 
> in primary/primary mode for a "Proxmox VE" cluster. Two servers are 
> connected via cross-cable and only one actually mounts the drbd device 
> and exports it as NFS volume. All access to the device is thus made via 
> NFS by the current NFS master node.
> 
> The nodes are running Kernel 2.6.32-26-pve with DRBD 8.3.13.
> 
> The DRBD config is as follows:
> 
> root at vm-office03:/home/chammers# egrep -v '^[[:space:]]*#' /etc/drbd.d/drbd0.res 
> resource drbd0 {
>     protocol C;
> 
>     startup {
>         become-primary-on both;
>     }
> 
>     on vm-office03 {
>         device    /dev/drbd0;
>         disk      /dev/sda3;
>         address   10.111.222.1:7789;
>         meta-disk internal;
>     }
> 
>     on vm-office04 {
>         device    /dev/drbd0;
>         disk      /dev/sda3;
>         address   10.111.222.2:7789;
>         meta-disk internal;
>     }
>    
>     disk {
>         no-disk-barrier;
>     }
> 
>     net {
>         cram-hmac-alg sha1;
>         shared-secret "drbd0proxmox";
>         data-integrity-alg crc32c;
>         allow-two-primaries;
>         after-sb-0pri discard-zero-changes;
>         after-sb-1pri discard-secondary;
>         after-sb-2pri disconnect;
> 
>     }
> 
>     syncer {
>         rate 40M;
>         csums-alg crc32c;
>         verify-alg md5;
>     }
> }
> 
> root at vm-office04:/home/chammers# grep -v '^[[:space:]]*#' /etc/drbd.d/drbd0.res 
> resource drbd0 {
>     protocol C;
> 
>     startup {
>         become-primary-on both;
>     }
> 
>     on vm-office03 {
>         device    /dev/drbd0;
>         disk      /dev/sda3;
>         address   10.111.222.1:7789;
>         meta-disk internal;
>     }
> 
>     on vm-office04 {
>         device    /dev/drbd0;
>         disk      /dev/sda3;
>         address   10.111.222.2:7789;
>         meta-disk internal;
>     }
> 
>     disk {
>         no-disk-barrier;
>     }
> 
>     net {
>         cram-hmac-alg sha1;
>         shared-secret "drbd0proxmox";
>         data-integrity-alg crc32c;
>         allow-two-primaries;
>         after-sb-0pri discard-zero-changes;
>         after-sb-1pri discard-secondary;
>         after-sb-2pri disconnect;
> 
>     }
> 
>     syncer {
>         rate 40M;
>         csums-alg crc32c;
>         verify-alg md5;
>     }
> 
> Best Regards
> 
> -christian-