Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
How are you replicating? Do you have a dedicated link or are you using a cable connected to a switch? The log shows the network connection dropping frequently. Marc Pope wrote: > Hello, > > I am new to drbd, and I am just getting this set up. I am having a few > problems. I am running 8.3 now on CentOS 5.3 64bit version. All latest > patches applied. > > I am getting errors, see way below.. I am just at the initial sync > step. > > nas1 is configured with: > 6 x 1TB & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID 5 > on the 6 drives > > nas2 is configured with > 4 x 1.5TB & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID > 5 on the 4 drives > > /data partition is /dev/sdb1 which is a total of 4,200,000 MB (about > 4TB) > /meta partition is /dev/sdb2 > > configuration: /etc/drbd.conf > > uname -a: > Linux nas2.mydomainhere.com 2.6.18-164.el5 #1 SMP Thu Sep 3 > 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux > > config: > > common { > protocol C; > syncer { > rate 60M; > al-extents 257; > } > } > resource r0 { > handlers { > pri-on-incon-degr "halt -f"; > } > disk { > on-io-error detach; > } > startup { > degr-wfc-timeout 120; > } > on nas1.mydomainhere.com { > device /dev/drbd0; > disk /dev/sdb1; > address XXX.XXX.137.40:7789; > meta-disk /dev/sdb2[0]; > } > on nas2.mydomainhere.com { > device /dev/drbd0; > disk /dev/sdb1; > address XXX.XXX.137.41:7789; > meta-disk /dev/sdb2[0]; > } > } > > > When first starting up, it starts syncing for about 10-15 minutes... > (you can see it's down to 3,513,200 left to sync)... > > # cat /proc/drbd > > version: 8.3.2 (api:88/proto:86-90) > GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by > mockbuild at v20z-x86-64.home.local, 2009-08-29 14:07:55 > 0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r---- > ns:30511104 nr:0 dw:0 dr:30511104 al:0 bm:1861 lo:0 pe:94 ua:0 > ap:0 ep:1 wo:b oos:3567008704 > [>....................] sync'ed: 0.9% (3483404/3513200)M > finish: 13:08:27 speed: 75,376 (58,332) K/sec > > > Then, in /var/log/messages, these errors start appearing: > > Oct 13 10:06:17 nas1 avahi-daemon[3793]: Invalid response packet. > Oct 13 10:06:17 nas1 last message repeated 9 times > > As soon as that starts, then we get all kinds of errors like this > (sorry for the long post, trying to be complete)... > > Oct 13 10:01:18 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:E0256B5135655E7D:22CB9163CBBE953B:027F952A21588331 > bits:899379200 flags:0 > Oct 13 10:01:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:01:18 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:01:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID ) > Oct 13 10:01:21 nas2 kernel: block drbd0: helper command: > /sbin/drbdadm before-resync-target minor-0 > Oct 13 10:01:21 nas2 kernel: block drbd0: helper command: > /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0) > Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFSyncUUID -> > SyncTarget ) > Oct 13 10:01:21 nas2 kernel: block drbd0: Began resync as SyncTarget > (will sync 3597516800 KB [899379200 bits set]). > > -- start of problems here.... > Oct 13 10:06:17 nas2 avahi-daemon[3971]: Invalid response packet. > Oct 13 10:06:17 nas2 last message repeated 9 times > Oct 13 10:10:41 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:10:41 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:10:41 nas2 kernel: block drbd0: asender terminated > Oct 13 10:10:41 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:10:41 nas2 kernel: block drbd0: short read receiving data: > read 3720 expected 4096 > Oct 13 10:10:41 nas2 kernel: block drbd0: error receiving RSDataReply, > l: 32792! > Oct 13 10:10:41 nas2 kernel: block drbd0: Connection closed > Oct 13 10:10:41 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:10:41 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:10:41 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:10:41 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:10:41 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:11:37 nas2 kernel: NETDEV WATCHDOG: eth0: transmit timed out > Oct 13 10:11:37 nas2 kernel: r8169: eth0: link up > Oct 13 10:11:39 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:11:39 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:11:39 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:11:39 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:11:39 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:11:39 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:11:39 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:11:39 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:11:39 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:11:39 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:11:49 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:11:49 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:11:49 nas2 kernel: block drbd0: asender terminated > Oct 13 10:11:49 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:11:49 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:11:49 nas2 kernel: block drbd0: Connection closed > Oct 13 10:11:49 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:11:49 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:11:49 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:11:49 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:11:49 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:12:18 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:12:18 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:12:18 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:12:18 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:12:18 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:12:18 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:12:18 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:12:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:12:18 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:12:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:12:29 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:12:29 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:12:29 nas2 kernel: block drbd0: asender terminated > Oct 13 10:12:29 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:12:29 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:12:29 nas2 kernel: block drbd0: Connection closed > Oct 13 10:12:29 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:12:29 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:12:29 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:12:29 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:12:29 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:12:54 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:12:54 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:12:54 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:12:54 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:12:54 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:12:54 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:12:54 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:12:54 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:12:54 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:12:54 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:13:05 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:13:05 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:13:05 nas2 kernel: block drbd0: asender terminated > Oct 13 10:13:05 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:13:05 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:13:05 nas2 kernel: block drbd0: Connection closed > Oct 13 10:13:05 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:13:05 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:13:05 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:13:05 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:13:05 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:19:07 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:19:07 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:19:07 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:19:07 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:19:37 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:19:37 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:19:37 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:19:37 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:19:37 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:19:37 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:19:37 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:19:37 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:19:37 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:19:37 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:19:47 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:19:47 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:19:47 nas2 kernel: block drbd0: asender terminated > Oct 13 10:19:47 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:19:47 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:19:47 nas2 kernel: block drbd0: Connection closed > Oct 13 10:19:47 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:19:47 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:19:47 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:19:47 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:19:47 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:20:17 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:20:17 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:20:17 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:20:17 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:20:17 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:20:17 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:20:17 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:20:17 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:20:17 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:20:17 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:20:27 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:20:27 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:20:27 nas2 kernel: block drbd0: asender terminated > Oct 13 10:20:27 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:20:27 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:20:27 nas2 kernel: block drbd0: Connection closed > Oct 13 10:20:27 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:20:27 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:20:27 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:20:27 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:20:27 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:20:48 nas2 kernel: block drbd0: Handshake successful: Agreed > network protocol version 90 > Oct 13 10:20:48 nas2 kernel: block drbd0: conn( WFConnection -> > WFReportParams ) > Oct 13 10:20:48 nas2 kernel: block drbd0: Starting asender thread > (from drbd0_receiver [4326]) > Oct 13 10:20:48 nas2 kernel: block drbd0: data-integrity-alg: <not-used> > Oct 13 10:20:48 nas2 kernel: block drbd0: drbd_sync_handshake: > Oct 13 10:20:48 nas2 kernel: block drbd0: self > 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000 > bits:891369192 flags:0 > Oct 13 10:20:48 nas2 kernel: block drbd0: peer > 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B > bits:891369192 flags:0 > Oct 13 10:20:48 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5 > Oct 13 10:20:48 nas2 kernel: block drbd0: Becoming sync target due to > disk states. > Oct 13 10:20:48 nas2 kernel: block drbd0: peer( Unknown -> Secondary ) > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 13 10:20:59 nas2 kernel: block drbd0: PingAck did not arrive in time. > Oct 13 10:20:59 nas2 kernel: block drbd0: peer( Secondary -> Unknown ) > conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) > Oct 13 10:20:59 nas2 kernel: block drbd0: asender terminated > Oct 13 10:20:59 nas2 kernel: block drbd0: Terminating asender thread > Oct 13 10:20:59 nas2 kernel: block drbd0: error receiving > ReportBitMap, l: 4088! > Oct 13 10:20:59 nas2 kernel: block drbd0: Connection closed > Oct 13 10:20:59 nas2 kernel: block drbd0: conn( NetworkFailure -> > Unconnected ) > Oct 13 10:20:59 nas2 kernel: block drbd0: receiver terminated > Oct 13 10:20:59 nas2 kernel: block drbd0: Restarting receiver thread > Oct 13 10:20:59 nas2 kernel: block drbd0: receiver (re)started > Oct 13 10:20:59 nas2 kernel: block drbd0: conn( Unconnected -> > WFConnection ) > Oct 13 10:21:17 nas2 avahi-daemon[3971]: Invalid response packet. > Oct 13 10:21:17 nas2 last message repeated 4 times > > At this point, ssh is extremely sluggish on nas2.. to the point it > takes 30-60 seconds to type anything... > > and you'll see it's no longer syncing > > [root at nas1 ~]# cat /proc/drbd > version: 8.3.2 (api:88/proto:86-90) > GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by > mockbuild at v20z-x86-64.home.local, 2009-08-29 14:07:55 > 0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Inconsistent C r---- > ns:32040144 nr:0 dw:0 dr:32048320 al:0 bm:1955 lo:0 pe:0 ua:0 ap:0 > ep:1 wo:b oos:3565476768 > > My only thought is that the switch between the 2 machines is bad, but > why would that lockup the machine...? > > thanks > Marc > > > > > > > > > > > > > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user > -- Telephone: 1-503-573-1262 Ex. 202 Sales: 1-877-4-LINBIT / 1-877-454-6248 LINBIT - Your Way to High Availability 8152 SW Hall Blvd., Suite #209 : Beaverton, OR 97008 http://www.linbit.com