[DRBD-user] FW: Frequent Disconnects Between Nodes

Brian E. Dunbar bdunbar at dunbarconsulting.org
Tue Aug 21 16:16:57 CEST 2007


I am having a problem with drbd disconnecting and reconnecting frequently. I
am currently running version: 8.0.5 (api:86/proto:86) SVN Revision: 3012
(and 0.7.22 see below).

When these disconnects occur, the load on the machine spikes to very high
numbers (sometimes more than 100). For background, we are using a crossover
cable on a gigabit Ethernet card for the synching. The disk that is synced
between drbd peers has a high volume of writes. I think the probability that
it is a hardware problem is low as I am seeing it on both our test boxes and
our production boxes.

I am seeing similar problems on version: 0.7.22 (api:79/proto:74) SVN
Revision: 2555M which our production machines are running. I upgraded our
test machines to 8.0.5 to see if it solved the problem and it does not seem
to have done so.

At one point on the 8.05 box, I saw this in the log:
Aug 20 08:05:40 kernel: [44069281.930000] drbd0: BUG! md_sync_timer expired!
Worker calls drbd_md_sync().

I have included snippets from /var/log/messages from the Primary and
Secondary for a corresponding time where the disconnect occurred.

Primary:
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: peer( Secondary ->
Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: Creating new current UUID
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: asender terminated
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: sock was shut down by peer
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: tl_clear()
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: Connection closed
Aug 20 07:44:04 kernel: [44067986.030000] drbd0: Writing meta data super
block now.
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: conn( NetworkFailure ->
Unconnected )
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: receiver terminated
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: receiver (re)started
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: conn( Unconnected ->
WFConnection )
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: conn( WFConnection ->
WFReportParams )
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: Handshake successful: DRBD
Network Protocol version 86
Aug 20 07:44:04 kernel: [44067986.120000] drbd0: peer( Unknown ->
Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( DUnknown -> UpToDate )
Aug 20 07:44:04 kernel: [44067986.170000] drbd0: Writing meta data super
block now.
Aug 20 07:44:04 kernel: [44067986.210000] drbd0: conn( WFBitMapS ->
SyncSource ) pdsk( UpToDate -> Inconsistent )
Aug 20 07:44:04 kernel: [44067986.210000] drbd0: Began resync as SyncSource
(will sync 484 KB [121 bits set]).
Aug 20 07:44:04 kernel: [44067986.210000] drbd0: Writing meta data super
block now.
Aug 20 07:44:04 kernel: [44067986.370000] drbd0: Resync done (total 1 sec;
paused 0 sec; 484 K/sec)
Aug 20 07:44:04 kernel: [44067986.380000] drbd0: conn( SyncSource ->
Connected ) pdsk( Inconsistent -> UpToDate )
Aug 20 07:44:04 kernel: [44067986.380000] drbd0: Writing meta data super
block now.


Secondary:
Aug 20 07:44:01 kernel: [44067168.510000] drbd0: peer( Primary -> Unknown )
conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Aug 20 07:44:01 kernel: [44067168.510000] drbd0: asender terminated
Aug 20 07:44:01 kernel: [44067168.510000] drbd0: tl_clear()
Aug 20 07:44:01 kernel: [44067168.510000] drbd0: Connection closed
Aug 20 07:44:01 kernel: [44067168.510000] drbd0: Writing meta data super
block now.
Aug 20 07:44:01 kernel: [44067168.520000] drbd0: conn( NetworkFailure ->
Unconnected )
Aug 20 07:44:01 kernel: [44067168.520000] drbd0: receiver terminated
Aug 20 07:44:01 kernel: [44067168.520000] drbd0: receiver (re)started
Aug 20 07:44:01 kernel: [44067168.520000] drbd0: conn( Unconnected ->
WFConnection )
Aug 20 07:44:04 kernel: [44067171.670000] drbd0: conn( WFConnection ->
WFReportParams )
Aug 20 07:44:04 kernel: [44067171.670000] drbd0: Handshake successful: DRBD
Network Protocol version 86
Aug 20 07:44:04 kernel: [44067171.680000] drbd0: peer( Unknown -> Primary )
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Aug 20 07:44:04 kernel: [44067171.680000] drbd0: Writing meta data super
block now.
Aug 20 07:44:04 kernel: [44067171.760000] drbd0: conn( WFBitMapT ->
WFSyncUUID )
Aug 20 07:44:04 kernel: [44067171.770000] drbd0: conn( WFSyncUUID ->
SyncTarget ) disk( UpToDate -> Inconsistent )
Aug 20 07:44:04 kernel: [44067171.770000] drbd0: Began resync as SyncTarget
(will sync 484 KB [121 bits set]).
Aug 20 07:44:04 kernel: [44067171.770000] drbd0: Writing meta data super
block now.
Aug 20 07:44:04 kernel: [44067171.930000] drbd0: Resync done (total 1 sec;
paused 0 sec; 484 K/sec)
Aug 20 07:44:04 kernel: [44067171.940000] drbd0: conn( SyncTarget ->
Connected ) disk( Inconsistent -> UpToDate )
Aug 20 07:44:04 kernel: [44067171.940000] drbd0: Writing meta data super
block now.


Any help is much appreciated,

Brian




More information about the drbd-user mailing list