[DRBD-user] Lots of errors after 10-15 minutes

Brian R. Hellman brian at linbit.com
Tue Oct 13 22:18:16 CEST 2009

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


How are you replicating?  Do you have a dedicated link or are you using
a cable connected to a switch?  The log shows the network connection
dropping frequently.



Marc Pope wrote:
> Hello,
>
> I am new to drbd, and I am just getting this set up. I am having a few
> problems. I am running 8.3 now on CentOS 5.3 64bit version. All latest
> patches applied.
>
> I am getting errors, see way below..   I am just at the initial sync
> step.
>
> nas1 is configured with:
>     6 x 1TB  & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID 5
> on the 6 drives
>
> nas2 is configured with
>     4 x  1.5TB & 2 x 250 drives, 8gb ram, adaptec 5805 raid card RAID
> 5 on the 4 drives
>
> /data partition is /dev/sdb1  which is a total of 4,200,000 MB (about
> 4TB)
> /meta partition is /dev/sdb2
>
> configuration: /etc/drbd.conf
>
> uname -a:
>     Linux nas2.mydomainhere.com 2.6.18-164.el5 #1 SMP Thu Sep 3
> 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
>
> config:
>
> common {
>     protocol C;
>     syncer {
>         rate          60M;
>         al-extents    257;
>         }
>     }
> resource r0 {
> handlers {
>     pri-on-incon-degr        "halt -f";
>     }
> disk {
>     on-io-error        detach;
>     }
> startup {
>     degr-wfc-timeout    120;
>     }
> on nas1.mydomainhere.com {
>     device        /dev/drbd0;
>     disk            /dev/sdb1;
>     address        XXX.XXX.137.40:7789;
>     meta-disk    /dev/sdb2[0];
> }
> on nas2.mydomainhere.com {
>     device        /dev/drbd0;
>     disk                /dev/sdb1;
>     address        XXX.XXX.137.41:7789;
>     meta-disk    /dev/sdb2[0];
> }
> }
>
>
> When first starting up, it starts syncing for about 10-15 minutes...
> (you can see it's down to 3,513,200 left to sync)...
>
> # cat /proc/drbd
>
> version: 8.3.2 (api:88/proto:86-90)
> GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by
> mockbuild at v20z-x86-64.home.local, 2009-08-29 14:07:55
>  0: cs:SyncSource ro:Secondary/Secondary ds:UpToDate/Inconsistent C r----
>     ns:30511104 nr:0 dw:0 dr:30511104 al:0 bm:1861 lo:0 pe:94 ua:0
> ap:0 ep:1 wo:b oos:3567008704
>     [>....................] sync'ed:  0.9% (3483404/3513200)M
>     finish: 13:08:27 speed: 75,376 (58,332) K/sec
>
>
> Then, in /var/log/messages, these errors start appearing:
>
> Oct 13 10:06:17 nas1 avahi-daemon[3793]: Invalid response packet.
> Oct 13 10:06:17 nas1 last message repeated 9 times
>
> As soon as that starts, then we get all kinds of errors like this
> (sorry for the long post, trying to be complete)...
>
> Oct 13 10:01:18 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:E0256B5135655E7D:22CB9163CBBE953B:027F952A21588331
> bits:899379200 flags:0
> Oct 13 10:01:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:01:18 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:01:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFBitMapT -> WFSyncUUID )
> Oct 13 10:01:21 nas2 kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0
> Oct 13 10:01:21 nas2 kernel: block drbd0: helper command:
> /sbin/drbdadm before-resync-target minor-0 exit code 0 (0x0)
> Oct 13 10:01:21 nas2 kernel: block drbd0: conn( WFSyncUUID ->
> SyncTarget )
> Oct 13 10:01:21 nas2 kernel: block drbd0: Began resync as SyncTarget
> (will sync 3597516800 KB [899379200 bits set]).
>
> -- start of problems here....
> Oct 13 10:06:17 nas2 avahi-daemon[3971]: Invalid response packet.
> Oct 13 10:06:17 nas2 last message repeated 9 times
> Oct 13 10:10:41 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:10:41 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:10:41 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:10:41 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:10:41 nas2 kernel: block drbd0: short read receiving data:
> read 3720 expected 4096
> Oct 13 10:10:41 nas2 kernel: block drbd0: error receiving RSDataReply,
> l: 32792!
> Oct 13 10:10:41 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:10:41 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:10:41 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:10:41 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:10:41 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:10:41 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:11:37 nas2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
> Oct 13 10:11:37 nas2 kernel: r8169: eth0: link up
> Oct 13 10:11:39 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:11:39 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:11:39 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:11:39 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:11:39 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:11:39 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:11:39 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:11:39 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:11:39 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:11:39 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:11:49 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:11:49 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:11:49 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:11:49 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:11:49 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:11:49 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:11:49 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:11:49 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:11:49 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:11:49 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:11:49 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:12:18 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:12:18 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:12:18 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:12:18 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:12:18 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:12:18 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:12:18 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:12:18 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:12:18 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:12:19 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:12:29 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:12:29 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:12:29 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:12:29 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:12:29 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:12:29 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:12:29 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:12:29 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:12:29 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:12:29 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:12:29 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:12:54 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:12:54 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:12:54 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:12:54 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:12:54 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:12:54 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:12:54 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:12:54 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:12:54 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:12:54 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:13:05 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:13:05 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:13:05 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:13:05 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:13:05 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:13:05 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:13:05 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:13:05 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:13:05 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:13:05 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:13:05 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:19:07 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:19:07 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:19:07 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:19:07 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:19:37 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:19:37 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:19:37 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:19:37 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:19:37 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:19:37 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:19:37 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:19:37 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:19:37 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:19:37 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:19:47 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:19:47 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:19:47 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:19:47 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:19:47 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:19:47 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:19:47 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:19:47 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:19:47 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:19:47 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:19:47 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:20:17 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:20:17 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:20:17 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:20:17 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:20:17 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:20:17 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:20:17 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:20:17 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:20:17 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:20:17 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:20:27 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:20:27 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:20:27 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:20:27 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:20:27 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:20:27 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:20:27 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:20:27 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:20:27 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:20:27 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:20:27 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:20:48 nas2 kernel: block drbd0: Handshake successful: Agreed
> network protocol version 90
> Oct 13 10:20:48 nas2 kernel: block drbd0: conn( WFConnection ->
> WFReportParams )
> Oct 13 10:20:48 nas2 kernel: block drbd0: Starting asender thread
> (from drbd0_receiver [4326])
> Oct 13 10:20:48 nas2 kernel: block drbd0: data-integrity-alg: <not-used>
> Oct 13 10:20:48 nas2 kernel: block drbd0: drbd_sync_handshake:
> Oct 13 10:20:48 nas2 kernel: block drbd0: self
> 286F730971D4FFB0:0000000000000000:0000000000000000:0000000000000000
> bits:891369192 flags:0
> Oct 13 10:20:48 nas2 kernel: block drbd0: peer
> 136EDF2D710BB952:286F730971D4FFB1:E0256B5135655E7D:22CB9163CBBE953B
> bits:891369192 flags:0
> Oct 13 10:20:48 nas2 kernel: block drbd0: uuid_compare()=-1 by rule 5
> Oct 13 10:20:48 nas2 kernel: block drbd0: Becoming sync target due to
> disk states.
> Oct 13 10:20:48 nas2 kernel: block drbd0: peer( Unknown -> Secondary )
> conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 13 10:20:59 nas2 kernel: block drbd0: PingAck did not arrive in time.
> Oct 13 10:20:59 nas2 kernel: block drbd0: peer( Secondary -> Unknown )
> conn( WFBitMapT -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 13 10:20:59 nas2 kernel: block drbd0: asender terminated
> Oct 13 10:20:59 nas2 kernel: block drbd0: Terminating asender thread
> Oct 13 10:20:59 nas2 kernel: block drbd0: error receiving
> ReportBitMap, l: 4088!
> Oct 13 10:20:59 nas2 kernel: block drbd0: Connection closed
> Oct 13 10:20:59 nas2 kernel: block drbd0: conn( NetworkFailure ->
> Unconnected )
> Oct 13 10:20:59 nas2 kernel: block drbd0: receiver terminated
> Oct 13 10:20:59 nas2 kernel: block drbd0: Restarting receiver thread
> Oct 13 10:20:59 nas2 kernel: block drbd0: receiver (re)started
> Oct 13 10:20:59 nas2 kernel: block drbd0: conn( Unconnected ->
> WFConnection )
> Oct 13 10:21:17 nas2 avahi-daemon[3971]: Invalid response packet.
> Oct 13 10:21:17 nas2 last message repeated 4 times
>
> At this point, ssh is extremely sluggish on nas2.. to the point it
> takes 30-60 seconds to type anything...
>
> and you'll see it's no longer syncing
>
> [root at nas1 ~]# cat /proc/drbd
> version: 8.3.2 (api:88/proto:86-90)
> GIT-hash: dd7985327f146f33b86d4bff5ca8c94234ce840e build by
> mockbuild at v20z-x86-64.home.local, 2009-08-29 14:07:55
>  0: cs:WFConnection ro:Secondary/Unknown ds:UpToDate/Inconsistent C r----
>     ns:32040144 nr:0 dw:0 dr:32048320 al:0 bm:1955 lo:0 pe:0 ua:0 ap:0
> ep:1 wo:b oos:3565476768
>
> My only thought is that the switch between the 2 machines is bad, but
> why would that lockup the machine...?
>
> thanks
> Marc
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user
>

-- 

Telephone: 1-503-573-1262 Ex. 202
Sales: 1-877-4-LINBIT / 1-877-454-6248 
LINBIT - Your Way to High Availability
8152 SW Hall Blvd., Suite #209 : Beaverton, OR 97008 

http://www.linbit.com




More information about the drbd-user mailing list