[DRBD-user] DRBD constantly re-syncing, getting to 100%, starting over. What?

Wed Oct 12 16:35:58 CEST 2016

Short in the dark - are the drives (or their controller if you're using raid) using any form of caching? It is conceivable that when resync is finished it tries flushing the data to the device, and if this takes waaaaay to long it could lead to timeout of the drbd kernel thread.
Is IO happening on those drives when they are resyncing?
Try running something like "sync ; sleep 1 ; sync" on the Inconsistent node when it's resyncing (I hope that won't kill your IO)

But that's really just a guess.

Jan

> On 12 Oct 2016, at 16:04, Eric Robinson <eric.robinson at psmnv.com> wrote:
> 
> This morning we are seeing an issue where drbd is repeatedly resyncing, getting to 100%, and starting over, and never getting to an UpToDate/UpToDate state.
>  
> On one node, it is logging this sequence over and over…
>  
> <snip>
>  
> Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender thread (from drbd_r_ha02_mys [804])
> Oct 12 06:56:11 ha14a kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:56:11 ha14a kernel: block drbd1: self 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5B bits:0 flags:0
> Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1 by rule 50
> Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target due to disk states.
> Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID )
> Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync uuid 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1
> Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0)
> Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFSyncUUID -> SyncTarget )
> Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in time.
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating drbd_a_ha02_mys
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( NetworkFailure -> Unconnected )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection )
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed network protocol version 101
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Peer authenticated using 20 bytes HMAC
> Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( WFConnection -> WFReportParams )
>  
> </snip>
>  
> On the other node, it is saying this over and over…
>  
> <snip>
>  
> Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake:
> Oct 12 06:58:51 ha14b kernel: block drbd1: self 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5B bits:0 flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 flags:0
> Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1 by rule 70
> Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source due to disk states.
> Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS )
> Oct 12 06:58:51 ha14b kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
> Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
> Oct 12 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource )
> Oct 12 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will sync 0 KB [0 bits set]).
> Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5B
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by peer
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary -> Unknown ) conn( SyncSource -> BrokenPipe )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: short read (expected size 16)
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe -> Unconnected )
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver terminated
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting receiver thread
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver (re)started
> Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection )
>  
> </snip>
>  
> However, I can guarantee that the network connection is solid. Running ping flood, I get 30,000 packets sent with no loss or latency.
>  
> Help, please?
>  
> --
> Eric Robinson
>  
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com <mailto:drbd-user at lists.linbit.com>
> http://lists.linbit.com/mailman/listinfo/drbd-user <http://lists.linbit.com/mailman/listinfo/drbd-user>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20161012/07a9253c/attachment.htm>