[DRBD-user] BrokenPipe

Mon Jul 24 00:33:11 CEST 2006

Hi,

I'm running an older version of drbd (0.6.12). I am planning on upgrading to 
0.7.20 this week. However, I'm not sure if that will necessarily solve my 
problem. I have three devices and the third one fails to sync.

Filesystem            Size  Used Avail Use% Mounted on
/dev/nb0              105G   83G   17G  83% /mirror	partition 1 on drive 1
/dev/nb1              113G   93G   15G  86% /nbu		partition 2 on drive 1
/dev/nb2              688G  6.1G  646G   1% /mbu       partition 1 on drive 2

0: cs:SyncingAll st:Primary/Secondary ns:9832972 nr:0 dw:39756 dr:9821117 
pe:41 ua:0
        [=>..................] sync'ed:  8.8% (99433/109003)M
        finish: 7:40:03h speed: 3,696 (2,738) K/sec
1: cs:SyncingAll st:Primary/Secondary ns:9816424 nr:0 dw:9448 dr:9835789 pe:39 
ua:0
        [=>..................] sync'ed:  8.2% (107782/117368)M
        finish: 8:27:08h speed: 3,631 (2,743) K/sec
2: cs:BrokenPipe st:Primary/Secondary ns:5521284 nr:0 dw:3352 dr:5525309 
pe:7085 ua:0
        NEEDS_SYNC

They all sync through the same 100 Mb pipe. I'm guessing the third 
one, /dev/nb2, is timing out because of too much traffic. That would explain 
why I see the status as 'BrokenPipe'. Is this a reasonable scenario? 

If so, then putting a second network card in both nodes should keep this from 
happening. Or have I just hit a limitation in 0.6.12 and upgrading will solve 
the problem. I guess since I'm planning on upgrading anyway, what I'd like to 
know if it is necessary to purchase two more network cards.

Thanks for your help,
Tom

PS: Here's my drbd.conf

resource drbd0 {
  protocol = C
  fsckcmd  = /bin/true
  inittimeout=120
  skip-wait
  disk {
    do-panic
    disk-size = 111619584k
  }
  net {
    sync-min    = 999k
    sync-max    = 12M   # maximal average syncer bandwidth
    tl-size     = 8000  # transfer log size, ensures strict write ordering
    timeout     = 60    # unit: 0.1 seconds
    connect-int = 10    # unit: seconds
    ping-int    = 10    # unit: seconds
    ko-count    = 4     # if some block send times out this many times,
			# the peer is considered dead, even if it still
			# answeres ping requests
  }

  on zan {
    device  = /dev/nb0
    disk    = /dev/hdb1
    address = 192.168.1.3
    port    = 7788
  }

  on jayna {
    device  = /dev/nb0
    disk    = /dev/hdb1
    address = 192.168.1.4
    port    = 7788
  }
}

resource drbd1 {
  protocol = C
  fsckcmd  = /bin/true
  inittimeout=120
  skip-wait
  disk {
    do-panic
    disk-size = 120185300k
  }
  net {
    sync-min    = 999k
    sync-max    = 12M   # maximal average syncer bandwidth
    tl-size     = 8000  # transfer log size, ensures strict write ordering
    timeout     = 60    # unit: 0.1 seconds
    connect-int = 10    # unit: seconds
    ping-int    = 10    # unit: seconds
    ko-count    = 4     # if some block send times out this many times,
			# the peer is considered dead, even if it still
			# answeres ping requests
  }

  on zan {
    device  = /dev/nb1
    disk    = /dev/hdb2
    address = 192.168.1.3
    port    = 7789
  }

  on jayna {
    device  = /dev/nb1
    disk    = /dev/hdb2
    address = 192.168.1.4
    port    = 7789
  }
}

resource drbd2 {
  protocol = C
  fsckcmd  = /bin/true
  inittimeout=120
  skip-wait
  disk {
    do-panic
    disk-size = 732572000k
  }
  net {
    sync-min    = 999k
    sync-max    = 12M   # maximal average syncer bandwidth
    tl-size     = 8000  # transfer log size, ensures strict write ordering
    timeout     = 60    # unit: 0.1 seconds
    connect-int = 10    # unit: seconds
    ping-int    = 10    # unit: seconds
    ko-count    = 4     # if some block send times out this many times,
			# the peer is considered dead, even if it still
			# answeres ping requests
  }

  on zan {
    device  = /dev/nb2
    disk    = /dev/hdd1
    address = 192.168.1.3
    port    = 7790
  }

  on jayna {
    device  = /dev/nb2
    disk    = /dev/hdd1
    address = 192.168.1.4
    port    = 7790
  }
}