[DRBD-user] [0.7.23] reconnect problem after link loss

Lukasz Engel lukasz.engel at softax.com.pl
Tue Apr 24 12:10:37 CEST 2007


I have 2 machines running drdb 0.7.23 (self compiled) with configured 5 
drdbX resources (and heartbeat running above),
drbd uses direct cross-over cable for synchronization. Kernel 2.6.19.2 
(vendor kernel - trustix 3) UP.

Today I disconnected and connected direct cable and after that 2 of 5 
drbds was failing to reconnect:
drbd0,2,4 successuly connected
drbd1 on secondary blocked in NetworkFailure state (WFConnection on 
primary)
drbd3 was retrying to reconnect, but could not succeed (always went to 
BrokenPipe after WFReportParams)

drbdadm down/up for both failed devices helped

full scenario:

start    all drbdN are Connected Primary/Secondary (or Secondary/Primary)
11:20:18 link disconnected
11:21:51 link connected -> drbd0,2,4 reconnected, drbd1,3 didn't
11:24    heartbeat shutdown for gauss1 (secondary for drbd1,3) (I wasn't 
sure if I had to shutdown whole drbd on the node)
11:28    drbdadm down/up www (drbd1) on gauss1 -> after that drbd1 
connected
11:29    drbdadm down/up dbdata (drbd3) on gauss2 -> after that drbd3 
connected

(I already observed similar problem some time ago, but it is not 100% 
repeatable, I cannot repeat it second time today)

/proc/drbd from both machines (taken before heartbeat shutdown on gauss1):

root at gauss1 ~# cat /proc/drbd
version: 0.7.23 (api:79/proto:74)
SVN Revision: 2686 build by root at gauss1.softax.local, 2007-02-01 00:22:23
0: cs:Connected st:Primary/Secondary ld:Consistent
   ns:231268 nr:8 dw:231280 dr:3255883 al:1 bm:265 lo:0 pe:0 ua:0 ap:0
1: cs:NetworkFailure st:Secondary/Primary ld:Consistent
   ns:876 nr:1863628 dw:1864504 dr:1329 al:5 bm:645 lo:0 pe:0 ua:0 ap:0
2: cs:Connected st:Primary/Secondary ld:Consistent
   ns:86870212 nr:199645112 dw:286551572 dr:1036641651 al:1186444 
bm:2615 lo:0 pe:0 ua:0 ap:0
3: cs:BrokenPipe st:Secondary/Unknown ld:Consistent
   ns:16260 nr:33465888 dw:33482256 dr:80785 al:61 bm:1014 lo:0 pe:0 
ua:0 ap:0
4: cs:Connected st:Secondary/Primary ld:Consistent
   ns:37 nr:1430 dw:1454 dr:1257 al:0 bm:93 lo:0 pe:0 ua:0 ap:0
-------------------
root at gauss2 ~# cat /proc/drbd
version: 0.7.23 (api:79/proto:74)
SVN Revision: 2686 build by root at gauss2.softax.local, 2007-01-31 17:12:23
0: cs:Connected st:Secondary/Primary ld:Consistent
   ns:0 nr:9820 dw:9820 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0
1: cs:WFConnection st:Primary/Unknown ld:Consistent
   ns:32900 nr:844 dw:33744 dr:259281 al:0 bm:5 lo:0 pe:0 ua:0 ap:0
2: cs:Connected st:Secondary/Primary ld:Consistent
   ns:0 nr:18973748 dw:18973748 dr:0 al:0 bm:94 lo:0 pe:0 ua:0 ap:0
3: cs:WFConnection st:Primary/Unknown ld:Consistent
   ns:3721668 nr:1880 dw:3725040 dr:390077 al:126 bm:0 lo:0 pe:0 ua:0 ap:0
4: cs:Connected st:Primary/Secondary ld:Consistent
   ns:6 nr:3 dw:9 dr:751 al:0 bm:0 lo:0 pe:0 ua:0 ap:0

---------------------

Config and logs from both machines are attached


-- 
Lukasz Engel

-------------- next part --------------
#
# Comment lines.
#

# global {

	# this is for people which set up a drbd device via the
	# loopback network interface or between two VMs on the same
	# box, for testing/simulating/presentation
	# otherwise it could trigger a run_tasq_queue deadlock.
#	disable_io_hints
# }

#
# this need not be drbd#, you may use phony resource names,
# like "resource web" or "resource mail", too
#


resource news {

  protocol C;
  incon-degr-cmd "halt -f";

  startup {
    wfc-timeout 1800;
    degr-wfc-timeout 120;
  }

  net {
    timeout     60;
    connect-int 10;
    ping-int    10;
  }
  
  syncer {
    group 0;
    rate 40960k;
  }

#  disk {
#    on-io-error 
#  }

  on gauss1.softax.local {
    device  /dev/drbd0;
    disk    /dev/evms/lvm/vgmirror/news;
    address 192.168.5.2:7780;
    meta-disk /dev/evms/lvm/vgmirror/drbd_news_meta[0];
  }

  on gauss2.softax.local {
    device  /dev/drbd0;
    disk    /dev/evms/lvm/vgmirror/news;
    address 192.168.5.3:7780;
    meta-disk /dev/evms/lvm/vgmirror/drbd_news_meta[0];
  }
}


resource www {

  protocol C;
  incon-degr-cmd "halt -f";

  startup {
    wfc-timeout 1800;
    degr-wfc-timeout 120;
  }

  net {
    timeout     60;
    connect-int 10;
    ping-int    10;
  }
  
  syncer {
    group 1;
    rate 40960k;
  }

#  disk {
#    on-io-error 
#  }

  on gauss1.softax.local {
    device  /dev/drbd1;
    disk    /dev/evms/lvm/vgmirror/www;
    address 192.168.5.2:7781;
    meta-disk /dev/evms/lvm/vgmirror/drbd_www_meta[0];
  }

  on gauss2.softax.local {
    device  /dev/drbd1;
    disk    /dev/evms/lvm/vgmirror/www;
    address 192.168.5.3:7781;
    meta-disk /dev/evms/lvm/vgmirror/drbd_www_meta[0];
  }
}


resource cvs {

  protocol C;
  incon-degr-cmd "halt -f";

  startup {
    wfc-timeout 1800;
    degr-wfc-timeout 120;
  }

  net {
    timeout     60;
    connect-int 10;
    ping-int    10;
  }
  
  syncer {
    group 2;
    rate 40960k;
  }

#  disk {
#    on-io-error 
#  }

  on gauss1.softax.local {
    device  /dev/drbd2;
    disk    /dev/evms/lvm/vgmirror/cvs;
    address 192.168.5.2:7782;
    meta-disk /dev/evms/lvm/vgmirror/drbd_cvs_meta[0];
  }

  on gauss2.softax.local {
    device  /dev/drbd2;
    disk    /dev/evms/lvm/vgmirror/cvs;
    address 192.168.5.3:7782;
    meta-disk /dev/evms/lvm/vgmirror/drbd_cvs_meta[0];
  }
}

resource dbdata {

  protocol C;
  incon-degr-cmd "halt -f";

  startup {
    wfc-timeout 1800;
    degr-wfc-timeout 120;
  }

  net {
    timeout     60;
    connect-int 10;
    ping-int    10;
  }
  
  syncer {
    group 3;
    rate 40960k;
  }

#  disk {
#    on-io-error 
#  }

  on gauss1.softax.local {
    device  /dev/drbd3;
    disk    /dev/evms/lvm/vgmirror/dbdata;
    address 192.168.5.2:7783;
    meta-disk /dev/evms/lvm/vgmirror/drbd_dbdata_meta[0];
  }

  on gauss2.softax.local {
    device  /dev/drbd3;
    disk    /dev/evms/lvm/vgmirror/dbdata;
    address 192.168.5.3:7783;
    meta-disk /dev/evms/lvm/vgmirror/drbd_dbdata_meta[0];
  }
}

resource ldap {

  protocol C;
  incon-degr-cmd "halt -f";

  startup {
    wfc-timeout 1800;
    degr-wfc-timeout 120;
  }

  net {
    timeout     60;
    connect-int 10;
    ping-int    10;
  }
  
  syncer {
    group 4;
    rate 40960k;
  }

#  disk {
#    on-io-error 
#  }

  on gauss1.softax.local {
    device  /dev/drbd4;
    disk    /dev/evms/lvm/vgmirror/ldap;
    address 192.168.5.2:7784;
    meta-disk /dev/evms/lvm/vgmirror/drbd_ldap_meta[0];
  }

  on gauss2.softax.local {
    device  /dev/drbd4;
    disk    /dev/evms/lvm/vgmirror/ldap;
    address 192.168.5.3:7784;
    meta-disk /dev/evms/lvm/vgmirror/drbd_ldap_meta[0];
  }
}



-------------- next part --------------
A non-text attachment was scrubbed...
Name: gauss1.log.gz
Type: application/gzip
Size: 3027 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-user/attachments/20070424/2942b630/gauss1.log.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: gauss2.log.gz
Type: application/gzip
Size: 2645 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-user/attachments/20070424/2942b630/gauss2.log.bin


More information about the drbd-user mailing list