[DRBD-user] Bad network connection causing DRBD to freeze

Rainer Sabelka sabelka at iue.tuwien.ac.at
Mon Feb 2 13:23:51 CET 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi,

I'm using DRBD (0.8.12) on a pair of servers in separate locations connected 
by an (almost) dedicated 1GBit ethernet link.
This connection has become unreliabe in a way that from time to time we see a 
packet loss up to 30 percent.
During the these phases of high packet loss, access to the DRBD device blocks 
for several minutes and the applications accessing the disk become completely 
unresponsive.

While we are trying to fix the network connetion in the first place I wonder 
if I can do something with DRBD to work around this problem.

From what I see in the logfiles It seems that DRBD detects the network 
failure, diconnects, and immediately trys to reconnect. Then it stays for 
several minutes in the WFBitMapS state.
It seems that any access to the DRBD device during this time blocks until the 
state SyncSource is reached.
If the packet loss on the network confinus for a longer periode this 
disconnect-reconnect cycle repeats several times. 
The result is that a disturbance in the network connection between the servers 
basically supends all running services which depend on DRBD.

To work around the problem I've now put DRBD into stand alone mode.
Is there anything else I can do about this?

-Rainer

---

PS: syslog output and drbd.conf:

on server2 (primary):

Jan 30 11:21:33 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 19
Jan 30 11:21:39 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 18
Jan 30 11:21:45 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 17
Jan 30 11:21:51 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 16
Jan 30 11:21:57 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 15
Jan 30 11:22:03 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 14
Jan 30 11:22:09 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 13
Jan 30 11:22:09 server2 kernel: drbd0: PingAck did not arrive in time.
Jan 30 11:22:09 server2 kernel: drbd0: peer( Secondary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk
( UpToDate -> DUnknown )
Jan 30 11:22:09 server2 kernel: drbd0: asender terminated
Jan 30 11:22:09 server2 kernel: drbd0: Terminating asender thread
Jan 30 11:22:09 server2 kernel: drbd0: short read expecting header on sock: 
r=-512
Jan 30 11:22:15 server2 kernel: drbd0: md_sync_timer expired! Worker calls 
drbd_md_sync().
Jan 30 11:22:15 server2 kernel: drbd0: Writing meta data super block now.
Jan 30 11:22:15 server2 kernel: drbd0: Creating new current UUID
Jan 30 11:22:15 server2 kernel: drbd0: Writing meta data super block now.
Jan 30 11:22:15 server2 kernel: drbd0: tl_clear()
Jan 30 11:22:15 server2 kernel: drbd0: Connection closed
Jan 30 11:22:15 server2 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jan 30 11:22:15 server2 kernel: drbd0: receiver terminated
Jan 30 11:22:15 server2 kernel: drbd0: receiver (re)started
Jan 30 11:22:15 server2 kernel: drbd0: conn( Unconnected -> WFConnection )
Jan 30 11:22:18 server2 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 86
Jan 30 11:22:18 server2 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jan 30 11:22:18 server2 kernel: drbd0: Starting asender thread (from 
drbd0_receiver [10567])
Jan 30 11:22:18 server2 kernel: drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk
( DUnknown -> UpToDate )
Jan 30 11:22:18 server2 kernel: drbd0: Writing meta data super block now.
Jan 30 11:22:36 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 19
Jan 30 11:22:42 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 18
Jan 30 11:22:48 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 17
Jan 30 11:22:54 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 16
Jan 30 11:23:00 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 15
Jan 30 11:23:06 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 14
Jan 30 11:23:12 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 13
Jan 30 11:23:18 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 12
Jan 30 11:23:24 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 11
Jan 30 11:23:30 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 10
Jan 30 11:23:36 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 9
Jan 30 11:23:42 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 8
Jan 30 11:23:48 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 7
Jan 30 11:23:54 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 6
Jan 30 11:24:00 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 5
Jan 30 11:24:06 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 4
Jan 30 11:24:12 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 3
Jan 30 11:24:24 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 2
Jan 30 11:24:30 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 1
Jan 30 11:24:46 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 19
Jan 30 11:25:10 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 19
Jan 30 11:25:16 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 18
Jan 30 11:25:22 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 17
Jan 30 11:25:28 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 16
Jan 30 11:25:34 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 15
Jan 30 11:25:40 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 14
Jan 30 11:25:46 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 13
Jan 30 11:25:52 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 12
Jan 30 11:25:58 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 11
Jan 30 11:26:04 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 10
Jan 30 11:26:10 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 9
Jan 30 11:26:16 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 8
Jan 30 11:26:22 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 7
Jan 30 11:26:28 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 6
Jan 30 11:26:34 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 5
Jan 30 11:26:40 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time 
expired, ko = 4
Jan 30 11:27:03 server2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( 
UpToDate -> Inconsistent )
Jan 30 11:27:03 server2 kernel: drbd0: Began resync as SyncSource (will sync 
3468 KB [867 bits set]).
Jan 30 11:27:03 server2 kernel: drbd0: Writing meta data super block now.
Jan 30 11:27:06 server2 kernel: drbd0: Resync done (total 3 sec; paused 0 sec; 
1156 K/sec)
Jan 30 11:27:06 server2 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( 
Inconsistent -> UpToDate )
Jan 30 11:27:06 server2 kernel: drbd0: Writing meta data super block now.


on server1 (secondary)

Jan 30 11:22:15 server1 kernel: drbd0: sock_recvmsg returned -104
Jan 30 11:22:15 server1 kernel: drbd0: peer( Primary -> Unknown ) conn( 
Connected -> NetworkFailure ) pdsk( UpToDate -> DUn
known )
Jan 30 11:22:15 server1 kernel: drbd0: asender terminated
Jan 30 11:22:15 server1 kernel: drbd0: Terminating asender thread
Jan 30 11:22:15 server1 kernel: drbd0: short read receiving data: read 2280 
expected 4096
Jan 30 11:22:15 server1 kernel: drbd0: error receiving Data, l: 4120!
Jan 30 11:22:15 server1 kernel: drbd0: Writing meta data super block now.
Jan 30 11:22:15 server1 kernel: drbd0: tl_clear()
Jan 30 11:22:15 server1 kernel: drbd0: Connection closed
Jan 30 11:22:15 server1 kernel: drbd0: conn( NetworkFailure -> Unconnected )
Jan 30 11:22:15 server1 kernel: drbd0: receiver terminated
Jan 30 11:22:15 server1 kernel: drbd0: receiver (re)started
Jan 30 11:22:15 server1 kernel: drbd0: conn( Unconnected -> WFConnection )
Jan 30 11:22:18 server1 kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 86
Jan 30 11:22:18 server1 kernel: drbd0: conn( WFConnection -> WFReportParams )
Jan 30 11:22:18 server1 kernel: drbd0: Starting asender thread (from 
drbd0_receiver [10883])
Jan 30 11:22:21 server1 kernel: drbd0: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpT
oDate )
Jan 30 11:22:21 server1 kernel: drbd0: Writing meta data super block now.
Jan 30 11:27:03 server1 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
Jan 30 11:27:03 server1 kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) disk( 
UpToDate -> Inconsistent )
Jan 30 11:27:03 server1 kernel: drbd0: Began resync as SyncTarget (will sync 
3468 KB [867 bits set]).
Jan 30 11:27:03 server1 kernel: drbd0: Writing meta data super block now.
Jan 30 11:27:06 server1 kernel: drbd0: Resync done (total 3 sec; paused 0 sec; 
1156 K/sec)
Jan 30 11:27:06 server1 kernel: drbd0: conn( SyncTarget -> Connected ) disk( 
Inconsistent -> UpToDate )
Jan 30 11:27:06 server1 kernel: drbd0: Writing meta data super block now.


----


# cat /etc/drbd.conf
global {                                 
    usage-count no;                      
}                                        

common {
  syncer { rate 100M; }
}                      

resource drbd_data {
  protocol C;       

  startup {
    # wfc-timeout  600;
    degr-wfc-timeout 120;
  }                      

  disk {
    on-io-error   detach;
  }                      

  net {
    after-sb-0pri disconnect;
    after-sb-1pri disconnect;
    after-sb-2pri disconnect;
    rr-conflict disconnect;
    ko-count 20;
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

  on server1 {
    device     /dev/drbd0;
    disk       /dev/sda7;
    address    10.43.101.111:7788;
    meta-disk  internal;
  }

  on server2 {
    device     /dev/drbd0;
    disk       /dev/sda7;
    address    10.43.101.112:7788;
    meta-disk  internal;
  }
}




More information about the drbd-user mailing list