Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I ran "/etc/init.d/drbd start" on both storage0 and storage1 (they are debian systems). I then ran the following commands on storage1 "drbdsetup /dev/drbd0 primary --do-what-i-say" "drbdsetup /dev/drbd1 primary --do-what-i-say" "drbdsetup /dev/drbd2 primary --do-what-i-say" "drbdsetup /dev/drbd3 primary --do-what-i-say" "drbdsetup /dev/drbd4 primary --do-what-i-say" "drbdsetup /dev/drbd5 primary --do-what-i-say" I then got the reported errors on storage0. The drbd conf is as follows: # # drbd.conf example # # parameters you _need_ to change are the hostname, device, disk, # meta-disk, address and port in the "on <hostname> {}" sections. # # you ought to know about the protocol, and the various timeouts. # # you probably want to set the rate in the syncer sections # # increase timeout and maybe ping-int in net{}, if you see # problems with "connection lost/connection established" # (or change your setup to reduce network latency; make sure full # duplex behaves as such; check average roundtrip times while # network is saturated; and so on ...) # # # At most ONE global section is allowed. # It must precede any resource section. # global { # use this if you want to define more resources later # without reloading the module. # by default we load the module with exactly as many devices # as configured mentioned in this file. # # minor-count 5; # The user dialog counts and displays the seconds it waited so # far. You might want to disable this if you have the console # of your server connected to a serial terminal server with # limited logging capacity. # The Dialog will print the count each 'dialog-refresh' seconds, # set it to 0 to disable redrawing completely. [ default = 1 ] # # dialog-refresh 5; # 5 seconds # this is for people who set up a drbd device via the # loopback network interface or between two VMs on the same # box, for testing/simulating/presentation # otherwise it could trigger a run_tasq_queue deadlock. # I'm not sure whether this deadlock can happen with two # nodes, but it seems at least extremely unlikely; and since # the io_hints boost performance, keep them enabled. # # With linux 2.6 it no longer makes sense. # So this option should vanish. --lge # # disable-io-hints; } resource home-log { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd0; disk /dev/sda5; address 10.16.101.27:7701; meta-disk /dev/sda7[0]; } on storage0 { device /dev/drbd0; disk /dev/sda5; address 10.16.101.16:7701; meta-disk /dev/sda7[0]; } } resource mail-log { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd1; disk /dev/sda6; address 10.16.101.27:7702; meta-disk /dev/sda7[1]; } on storage0 { device /dev/drbd1; disk /dev/sda6; address 10.16.101.16:7702; meta-disk /dev/sda7[1]; } } resource etc { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd2; disk /dev/sdb1; address 10.16.101.27:7703; meta-disk /dev/sda7[2]; } on storage0 { device /dev/drbd2; disk /dev/sdb1; address 10.16.101.16:7703; meta-disk /dev/sda7[2]; } } resource scripts { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd3; disk /dev/sdb2; address 10.16.101.27:7704; meta-disk /dev/sda7[3]; } on storage0 { device /dev/drbd3; disk /dev/sdb2; address 10.16.101.16:7704; meta-disk /dev/sda7[3]; } } resource home { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd4; disk /dev/sdb3; address 10.16.101.27:7705; meta-disk /dev/sda7[4]; } on storage0 { device /dev/drbd4; disk /dev/sdb3; address 10.16.101.16:7705; meta-disk /dev/sda7[4]; } } resource mail { protocol C; # what should be done in case the cluster starts up in # degraded mode, but knows it has inconsistent data. incon-degr-cmd "echo '!DRBD! pri on incon-degr' | wall ; sleep 60 ; halt -f"; startup { # Wait for connection timeout. # The init script blocks the boot process until the resources # are connected. # In case you want to limit the wait time, do it here. # # wfc-timeout 0; # Wait for connection timeout if this node was a degraded cluster. # In case a degraded cluster (= cluster with only one node left) # is rebooted, this timeout value is used. # degr-wfc-timeout 120; # 2 minutes. } disk { # if the lower level device reports io-error you have the choice of # "pass_on" -> Report the io-error to the upper layers. # Primary -> report it to the mounted file system. # Secondary -> ignore it. # "panic" -> The node leaves the cluster by doing a kernel panic. # "detach" -> The node drops its backing storage device, and # continues in disk less mode. # on-io-error detach; } net { # this is the size of the tcp socket send buffer # increase it _carefully_ if you want to use protocol A over a # high latency network with reasonable write throughput. # defaults to 2*65535; you might try even 1M, but if your kernel or # network driver chokes on that, you have been warned. # sndbuf-size 512k; # timeout 60; # 6 seconds (unit = 0.1 seconds) # connect-int 10; # 10 seconds (unit = 1 second) # ping-int 10; # 10 seconds (unit = 1 second) # Maximal number of requests (4K) to be allocated by DRBD. # The minimum is hardcoded to 32 (=128 kb). # For hight performance installations it might help if you # increase that number. These buffers are used to hold # datablocks while they are written to disk. # max-buffers 2048; # The highest number of data blocks between two write barriers. # If you set this < 10 you might decrease your performance. # max-epoch-size 2048; # if some block send times out this many times, the peer is # considered dead, even if it still answers ping requests. # ko-count 4; # if the connection to the peer is lost you have the choice of # "reconnect" -> Try to reconnect (AKA WFConnection state) # "stand_alone" -> Do not reconnect (AKA StandAlone state) # "freeze_io" -> Try to reconnect but freeze all IO until # the connection is established again. # on-disconnect reconnect; } syncer { # Limit the bandwith used by the resynchronisation process. # default unit is KB/sec; optional suffixes K,M,G are allowed # rate 10M; # All devices in one group are resynchronized parallel. # Resychronisation of groups is serialized in ascending order. # Put DRBD resources which are on different physical disks in one group. # Put DRBD resources on one physical disk in different groups. # group 1; # Configures the size of the active set. Each extent is 4M, # 257 Extents ~> 1GB active set size. In case your syncer # runs @ 10MB/sec, all resync after a primary's crash will last # 1GB / ( 10MB/sec ) ~ 102 seconds ~ One Minute and 42 Seconds. # BTW, the hash algorithm works best if the number of al-extents # is prime. (To test the worst case performace use a power of 2) al-extents 257; } on storage1 { device /dev/drbd5; disk /dev/sdc1; address 10.16.101.27:7706; meta-disk /dev/sda7[5]; } on storage0 { device /dev/drbd5; disk /dev/sdc1; address 10.16.101.16:7706; meta-disk /dev/sda7[5]; } } Philipp Reisner wrote: > > > On Tuesday 16 November 2004 09:20, you wrote: > > Ahh, drbd version 0.7.5 and kernel 2.6.9 > > > > Steven > > > > Philipp Reisner wrote: > > > [...] > > > > > >> I noticed that there were a few threads related to the sock_sendmsg > > >>time expired, error, but did not see the second error in the archive. > > >> > > >>Is there enough information above to debug the problem? > > > > > > Hmmm, first of all I would like to know which version of DRBD you > used? > > > And which kernel version ? > > > > > > -philipp > > Hi, > > Hmm. What did you exactly do. > > Please post all steps you did. > "drbdsetup /dev/drbd5 primary --do-waht-I-say " was the last command you > issued before it OOPSed, waht did you do before ? > > Please post all commands you issues on both nodes. > > -phil > -- > : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com : > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > > --- > Incoming mail is certified Virus Free. > Checked by AVG anti-virus system (http://www.grisoft.com). > Version: 6.0.779 / Virus Database: 526 - Release Date: 10/19/2004 > >