[DRBD-user] Lots of errors after 10-15 minutes

Marc Pope marc.pope at falconarea.net
Tue Oct 13 17:24:35 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I am new to drbd, and I am just getting this set up. I am having a few  
problems. I am running 8.3 now on CentOS 5.3 64bit version. All latest  
patches applied.

I am getting errors, see way below..   I am just at the initial sync  
step.

nas1 is configured with:
	6 x 1TB  & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID 5 on  
the 6 drives

nas2 is configured with
	4 x  1.5TB & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID 5  
on the 4 drives

/data partition is /dev/sdb1  which is a total of 4,200,000 MB (about  
4TB)
/meta partition is /dev/sdb2

configuration: /etc/drbd.conf

uname -a:
	Linux nas2.mydomainhere.com 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30  
EDT 2009 x86_64 x86_64 x86_64 GNU/Linux

config:

common {
	protocol C;
	syncer {
		rate          60M;
		al-extents    257;
		}
	}
resource r0 {
handlers {
	pri-on-incon-degr    	"halt -f";
	}
disk {
	on-io-error		detach;
	}
startup {
	degr-wfc-timeout	120;
	}
on nas1.mydomainhere.com {
	device		/dev/drbd0;
	disk			/dev/sdb1;
	address		XXX.XXX.137.40:7789;
	meta-disk	/dev/sdb2[0];
}
on nas2.mydomainhere.com {
	device		/dev/drbd0;
	disk        		/dev/sdb1;
	address		XXX.XXX.137.41:7789;
	meta-disk	/dev/sdb2[0];
}
}


When first starting up, it starts syncing for about 10-15 minutes...  
(you can see it's down to 3,513,200 left to sync)...

# cat /proc/drbd

version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild at v20z-x86-64.home.local 
, 2009-08-29 14:07:55
  0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C  
r----
     ns:30511104 nr:0 dw:0 dr:30511104 al:0 bm:1861 lo:0 pe:94 ua:0 ap: 
0 ep:1 wo:b oos:3567008704
	[>....................] sync'ed:  0.9% (3483404/3513200)M
	finish: 13:08:27 speed: 75,376 (58,332) K/sec


Then, in /var/log/messages, these errors start appearing:

Oct 13 10:06:17 nas1 avahi-daemon[3793]: Invalid response packet.
Oct 13 10:06:17 nas1 last message repeated 9 times

As soon as that starts, then we get all kinds of errors like this  
(sorry for the long post, trying to be complete)...

Oct 13 10:01:18 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:E0256B5135655E7D:22CB9163CBBE953B:027F952A21588331  
bits:899379200 flags:0
Oct 13 10:01:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:01:18 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:01:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFBitMapT ->  
WFSyncUUID )
Oct 13 10:01:21 nas2 kernel: block drbd0: helper command: /sbin/ 
drbdadm before-resync-target minor-0
Oct 13 10:01:21 nas2 kernel: block drbd0: helper command: /sbin/ 
drbdadm before-resync-target minor-0 exit code 0 (0x0)
Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFSyncUUID ->  
SyncTarget )
Oct 13 10:01:21 nas2 kernel: block drbd0: Began resync as SyncTarget  
(will sync 3597516800 KB [899379200 bits set]).

-- start of problems here.... 

Oct 13 10:06:17 nas2 avahi-daemon[3971]: Invalid response packet.
Oct 13 10:06:17 nas2 last message repeated 9 times
Oct 13 10:10:41 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:10:41 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:10:41 nas2 kernel: block drbd0: asender terminated
Oct 13 10:10:41 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:10:41 nas2 kernel: block drbd0: short read receiving data:  
read 3720 expected 4096
Oct 13 10:10:41 nas2 kernel: block drbd0: error receiving RSDataReply,  
l: 32792!
Oct 13 10:10:41 nas2 kernel: block drbd0: Connection closed
Oct 13 10:10:41 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:10:41 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:10:41 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:10:41 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:10:41 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:11:37 nas2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Oct 13 10:11:37 nas2 kernel: r8169: eth0: link up
Oct 13 10:11:39 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:11:39 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:11:39 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:11:39 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:11:39 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:11:39 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:11:39 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:11:39 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:11:39 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:11:39 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:11:49 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:11:49 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:11:49 nas2 kernel: block drbd0: asender terminated
Oct 13 10:11:49 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:11:49 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:11:49 nas2 kernel: block drbd0: Connection closed
Oct 13 10:11:49 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:11:49 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:11:49 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:11:49 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:11:49 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:12:18 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:12:18 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:12:18 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:12:18 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:12:18 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:12:18 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:12:18 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:12:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:12:18 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:12:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:12:29 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:12:29 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:12:29 nas2 kernel: block drbd0: asender terminated
Oct 13 10:12:29 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:12:29 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:12:29 nas2 kernel: block drbd0: Connection closed
Oct 13 10:12:29 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:12:29 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:12:29 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:12:29 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:12:29 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:12:54 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:12:54 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:12:54 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:12:54 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:12:54 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:12:54 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:12:54 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:12:54 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:12:54 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:12:54 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:13:05 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:13:05 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:13:05 nas2 kernel: block drbd0: asender terminated
Oct 13 10:13:05 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:13:05 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:13:05 nas2 kernel: block drbd0: Connection closed
Oct 13 10:13:05 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:13:05 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:13:05 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:13:05 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:13:05 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:19:07 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:19:07 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:19:07 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:19:07 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:19:37 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:19:37 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:19:37 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:19:37 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:19:37 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:19:37 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:19:37 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:19:37 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:19:37 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:19:37 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:19:47 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:19:47 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:19:47 nas2 kernel: block drbd0: asender terminated
Oct 13 10:19:47 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:19:47 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:19:47 nas2 kernel: block drbd0: Connection closed
Oct 13 10:19:47 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:19:47 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:19:47 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:19:47 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:19:47 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:20:17 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:20:17 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:20:17 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:20:17 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:20:17 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:20:17 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:20:17 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:20:17 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:20:17 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:20:17 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:20:27 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:20:27 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:20:27 nas2 kernel: block drbd0: asender terminated
Oct 13 10:20:27 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:20:27 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:20:27 nas2 kernel: block drbd0: Connection closed
Oct 13 10:20:27 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:20:27 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:20:27 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:20:27 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:20:27 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:20:48 nas2 kernel: block drbd0: Handshake successful: Agreed  
network protocol version 90
Oct 13 10:20:48 nas2 kernel: block drbd0: conn( WFConnection ->  
WFReportParams )
Oct 13 10:20:48 nas2 kernel: block drbd0: Starting asender thread  
(from drbd0_receiver [4326])
Oct 13 10:20:48 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
Oct 13 10:20:48 nas2 kernel: block drbd0: drbd_sync_handshake:
Oct 13 10:20:48 nas2 kernel: block drbd0: self  
286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000  
bits:891369192 flags:0
Oct 13 10:20:48 nas2 kernel: block drbd0: peer  
136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B  
bits:891369192 flags:0
Oct 13 10:20:48 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
Oct 13 10:20:48 nas2 kernel: block drbd0: Becoming sync target due to  
disk states.
Oct 13 10:20:48 nas2 kernel: block drbd0: peer( Unknown -> Secondary )  
conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
Oct 13 10:20:59 nas2 kernel: block drbd0: PingAck did not arrive in  
time.
Oct 13 10:20:59 nas2 kernel: block drbd0: peer( Secondary -> Unknown )  
conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
Oct 13 10:20:59 nas2 kernel: block drbd0: asender terminated
Oct 13 10:20:59 nas2 kernel: block drbd0: Terminating asender thread
Oct 13 10:20:59 nas2 kernel: block drbd0: error receiving  
ReportBitMap, l: 4088!
Oct 13 10:20:59 nas2 kernel: block drbd0: Connection closed
Oct 13 10:20:59 nas2 kernel: block drbd0: conn( NetworkFailure ->  
Unconnected )
Oct 13 10:20:59 nas2 kernel: block drbd0: receiver terminated
Oct 13 10:20:59 nas2 kernel: block drbd0: Restarting receiver thread
Oct 13 10:20:59 nas2 kernel: block drbd0: receiver (re)started
Oct 13 10:20:59 nas2 kernel: block drbd0: conn( Unconnected ->  
WFConnection )
Oct 13 10:21:17 nas2 avahi-daemon[3971]: Invalid response packet.
Oct 13 10:21:17 nas2 last message repeated 4 times

At this point, ssh is extremely sluggish on nas2.. to the point it  
takes 30-60 seconds to type anything...

and you'll see it's no longer syncing

[root at nas1 ~]# cat /proc/drbd
version: 8.3.2 (api:88/proto:86-90)
GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by mockbuild at v20z-x86-64.home.local 
, 2009-08-29 14:07:55
  0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Inconsistent C  
r----
     ns:32040144 nr:0 dw:0 dr:32048320 al:0 bm:1955 lo:0 pe:0 ua:0 ap: 
0 ep:1 wo:b oos:3565476768

My only thought is that the switch between the 2 machines is bad, but  
why would that lockup the machine...?

thanks
Marc
















More information about the drbd-user mailing list