Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I'm using DRBD (0.8.12) on a pair of servers in separate locations connected by an (almost) dedicated 1GBit ethernet link. This connection has become unreliabe in a way that from time to time we see a packet loss up to 30 percent. During the these phases of high packet loss, access to the DRBD device blocks for several minutes and the applications accessing the disk become completely unresponsive. While we are trying to fix the network connetion in the first place I wonder if I can do something with DRBD to work around this problem. From what I see in the logfiles It seems that DRBD detects the network failure, diconnects, and immediately trys to reconnect. Then it stays for several minutes in the WFBitMapS state. It seems that any access to the DRBD device during this time blocks until the state SyncSource is reached. If the packet loss on the network confinus for a longer periode this disconnect-reconnect cycle repeats several times. The result is that a disturbance in the network connection between the servers basically supends all running services which depend on DRBD. To work around the problem I've now put DRBD into stand alone mode. Is there anything else I can do about this? -Rainer --- PS: syslog output and drbd.conf: on server2 (primary): Jan 30 11:21:33 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 19 Jan 30 11:21:39 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 18 Jan 30 11:21:45 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 17 Jan 30 11:21:51 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 16 Jan 30 11:21:57 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 15 Jan 30 11:22:03 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 14 Jan 30 11:22:09 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 13 Jan 30 11:22:09 server2 kernel: drbd0: PingAck did not arrive in time. Jan 30 11:22:09 server2 kernel: drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk ( UpToDate -> DUnknown ) Jan 30 11:22:09 server2 kernel: drbd0: asender terminated Jan 30 11:22:09 server2 kernel: drbd0: Terminating asender thread Jan 30 11:22:09 server2 kernel: drbd0: short read expecting header on sock: r=-512 Jan 30 11:22:15 server2 kernel: drbd0: md_sync_timer expired! Worker calls drbd_md_sync(). Jan 30 11:22:15 server2 kernel: drbd0: Writing meta data super block now. Jan 30 11:22:15 server2 kernel: drbd0: Creating new current UUID Jan 30 11:22:15 server2 kernel: drbd0: Writing meta data super block now. Jan 30 11:22:15 server2 kernel: drbd0: tl_clear() Jan 30 11:22:15 server2 kernel: drbd0: Connection closed Jan 30 11:22:15 server2 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jan 30 11:22:15 server2 kernel: drbd0: receiver terminated Jan 30 11:22:15 server2 kernel: drbd0: receiver (re)started Jan 30 11:22:15 server2 kernel: drbd0: conn( Unconnected -> WFConnection ) Jan 30 11:22:18 server2 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Jan 30 11:22:18 server2 kernel: drbd0: conn( WFConnection -> WFReportParams ) Jan 30 11:22:18 server2 kernel: drbd0: Starting asender thread (from drbd0_receiver [10567]) Jan 30 11:22:18 server2 kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk ( DUnknown -> UpToDate ) Jan 30 11:22:18 server2 kernel: drbd0: Writing meta data super block now. Jan 30 11:22:36 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 19 Jan 30 11:22:42 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 18 Jan 30 11:22:48 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 17 Jan 30 11:22:54 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 16 Jan 30 11:23:00 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 15 Jan 30 11:23:06 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 14 Jan 30 11:23:12 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 13 Jan 30 11:23:18 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 12 Jan 30 11:23:24 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 11 Jan 30 11:23:30 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 10 Jan 30 11:23:36 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 9 Jan 30 11:23:42 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 8 Jan 30 11:23:48 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 7 Jan 30 11:23:54 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 6 Jan 30 11:24:00 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 5 Jan 30 11:24:06 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 4 Jan 30 11:24:12 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 3 Jan 30 11:24:24 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 2 Jan 30 11:24:30 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 1 Jan 30 11:24:46 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 19 Jan 30 11:25:10 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 19 Jan 30 11:25:16 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 18 Jan 30 11:25:22 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 17 Jan 30 11:25:28 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 16 Jan 30 11:25:34 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 15 Jan 30 11:25:40 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 14 Jan 30 11:25:46 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 13 Jan 30 11:25:52 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 12 Jan 30 11:25:58 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 11 Jan 30 11:26:04 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 10 Jan 30 11:26:10 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 9 Jan 30 11:26:16 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 8 Jan 30 11:26:22 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 7 Jan 30 11:26:28 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 6 Jan 30 11:26:34 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 5 Jan 30 11:26:40 server2 kernel: drbd0: [drbd0_worker/13962] sock_sendmsg time expired, ko = 4 Jan 30 11:27:03 server2 kernel: drbd0: conn( WFBitMapS -> SyncSource ) pdsk( UpToDate -> Inconsistent ) Jan 30 11:27:03 server2 kernel: drbd0: Began resync as SyncSource (will sync 3468 KB [867 bits set]). Jan 30 11:27:03 server2 kernel: drbd0: Writing meta data super block now. Jan 30 11:27:06 server2 kernel: drbd0: Resync done (total 3 sec; paused 0 sec; 1156 K/sec) Jan 30 11:27:06 server2 kernel: drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) Jan 30 11:27:06 server2 kernel: drbd0: Writing meta data super block now. on server1 (secondary) Jan 30 11:22:15 server1 kernel: drbd0: sock_recvmsg returned -104 Jan 30 11:22:15 server1 kernel: drbd0: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUn known ) Jan 30 11:22:15 server1 kernel: drbd0: asender terminated Jan 30 11:22:15 server1 kernel: drbd0: Terminating asender thread Jan 30 11:22:15 server1 kernel: drbd0: short read receiving data: read 2280 expected 4096 Jan 30 11:22:15 server1 kernel: drbd0: error receiving Data, l: 4120! Jan 30 11:22:15 server1 kernel: drbd0: Writing meta data super block now. Jan 30 11:22:15 server1 kernel: drbd0: tl_clear() Jan 30 11:22:15 server1 kernel: drbd0: Connection closed Jan 30 11:22:15 server1 kernel: drbd0: conn( NetworkFailure -> Unconnected ) Jan 30 11:22:15 server1 kernel: drbd0: receiver terminated Jan 30 11:22:15 server1 kernel: drbd0: receiver (re)started Jan 30 11:22:15 server1 kernel: drbd0: conn( Unconnected -> WFConnection ) Jan 30 11:22:18 server1 kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 Jan 30 11:22:18 server1 kernel: drbd0: conn( WFConnection -> WFReportParams ) Jan 30 11:22:18 server1 kernel: drbd0: Starting asender thread (from drbd0_receiver [10883]) Jan 30 11:22:21 server1 kernel: drbd0: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpT oDate ) Jan 30 11:22:21 server1 kernel: drbd0: Writing meta data super block now. Jan 30 11:27:03 server1 kernel: drbd0: conn( WFBitMapT -> WFSyncUUID ) Jan 30 11:27:03 server1 kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) Jan 30 11:27:03 server1 kernel: drbd0: Began resync as SyncTarget (will sync 3468 KB [867 bits set]). Jan 30 11:27:03 server1 kernel: drbd0: Writing meta data super block now. Jan 30 11:27:06 server1 kernel: drbd0: Resync done (total 3 sec; paused 0 sec; 1156 K/sec) Jan 30 11:27:06 server1 kernel: drbd0: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) Jan 30 11:27:06 server1 kernel: drbd0: Writing meta data super block now. ---- # cat /etc/drbd.conf global { usage-count no; } common { syncer { rate 100M; } } resource drbd_data { protocol C; startup { # wfc-timeout 600; degr-wfc-timeout 120; } disk { on-io-error detach; } net { after-sb-0pri disconnect; after-sb-1pri disconnect; after-sb-2pri disconnect; rr-conflict disconnect; ko-count 20; } syncer { rate 100M; al-extents 257; } on server1 { device /dev/drbd0; disk /dev/sda7; address 10.43.101.111:7788; meta-disk internal; } on server2 { device /dev/drbd0; disk /dev/sda7; address 10.43.101.112:7788; meta-disk internal; } }