[DRBD-user] DRBD constantly re-syncing, getting to 100%, starting over. What?

Fri Oct 14 08:59:13 CEST 2016

> -----Original Message-----
> From: Viktor Villafuerte [mailto:viktor.villafuerte at optusnet.com.au]
> Sent: Wednesday, October 12, 2016 3:19 PM
> To: Eric Robinson <eric.robinson at psmnv.com>
> Cc: drbd-user at lists.linbit.com
> Subject: Re: [DRBD-user] DRBD constantly re-syncing, getting to 100%,
> starting over. What?
> 
> Hi Eric,
> 
> I've had the pleasure to deal with this exact issue, and in prod too :O
> 
> 
> On Wed 12 Oct 2016 14:04:48, Eric Robinson wrote:
> > This morning we are seeing an issue where drbd is repeatedly resyncing,
> getting to 100%, and starting over, and never getting to an
> UpToDate/UpToDate state.
> >
> > On one node, it is logging this sequence over and over…
> >
> > <snip>
> >
> > Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender
> > thread (from drbd_r_ha02_mys [804]) Oct 12 06:56:11 ha14a kernel: block
> drbd1: drbd_sync_handshake:
> > Oct 12 06:56:11 ha14a kernel: block drbd1: self
> >
> 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D
> 9
> > bits:0 flags:0 Oct 12 06:56:11 ha14a kernel: block drbd1: peer
> >
> 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5
> B bits:0 flags:0 Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1
> by rule 50 Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target
> due to disk states.
> > Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary )
> > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
> Oct
> > 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats
> > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> > Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats
> > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0%
> > Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT ->
> > WFSyncUUID ) Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync
> > uuid
> >
> 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D
> 9
> > Oct 12 06:56:11 ha14a kernel: block drbd1: helper command:
> > /sbin/drbdadm before-resync-target minor-1 Oct 12 06:56:11 ha14a
> > kernel: block drbd1: helper command: /sbin/drbdadm
> > before-resync-target minor-1 exit code 0 (0x0) Oct 12 06:56:11 ha14a
> kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) Oct 12 06:56:11 ha14a
> kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]).
> 
> The two lines below are the important lines, where DRBD assumes network
> failure due PingAck not arrving in time.
> 
> > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in
> time.
> > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary ->
> > Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate ->
> > DUnknown )
> 
> You need to increase the timeout withing which PingAck is expected.
> 
> drbdadm net-options -v --ping-timeout=10 drbd0
> 
> this is the command I used. The --ping-timeout is in 10th of a second so value
> of '10' is actually 1s. Please confirm this in documentation as the version of
> DRBD I run this on was 8.x
> 
> Also you may need to tweak the timeout a bit..
> 
> 
> Hope this helps
> 

Thanks much for the input. It was an emergency, so I rebooted the node. After it came back up, all was well. The cluster has been fine for months and now it is apparently fine again. I can't imagine why it would suddenly behave this way. 

> v
> 
> 
> > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated Oct
> > 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating
> > drbd_a_ha02_mys Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql:
> > Connection closed Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql:
> > conn( NetworkFailure -> Unconnected ) Oct 12 06:56:12 ha14a kernel:
> > d-con ha02_mysql: receiver terminated Oct 12 06:56:12 ha14a kernel:
> > d-con ha02_mysql: Restarting receiver thread Oct 12 06:56:12 ha14a
> > kernel: d-con ha02_mysql: receiver (re)started Oct 12 06:56:12 ha14a
> > kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection ) Oct 12
> > 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed
> > network protocol version 101 Oct 12 06:56:12 ha14a kernel: d-con
> > ha02_mysql: Peer authenticated using 20 bytes HMAC Oct 12 06:56:12
> > ha14a kernel: d-con ha02_mysql: conn( WFConnection -> WFReportParams
> )
> >
> > </snip>
> >
> > On the other node, it is saying this over and over…
> >
> > <snip>
> >
> > Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake:
> > Oct 12 06:58:51 ha14b kernel: block drbd1: self
> >
> 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5
> B
> > bits:0 flags:0 Oct 12 06:58:51 ha14b kernel: block drbd1: peer
> >
> 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D
> 9 bits:0 flags:0 Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1
> by rule 70 Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source
> due to disk states.
> > Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary
> > ) conn( WFReportParams -> WFBitMapS ) Oct 12 06:58:51 ha14b kernel:
> > block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE
> > 23(1), total 23; compression: 100.0% Oct 12 06:58:51 ha14b kernel:
> > block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE
> > 23(1), total 23; compression: 100.0% Oct 12 06:58:51 ha14b kernel:
> > block drbd1: helper command: /sbin/drbdadm before-resync-source
> minor-1 Oct 12 06:58:51 ha14b kernel: block drbd1: helper command:
> /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Oct 12
> 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource ) Oct 12
> 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will sync 0
> KB [0 bits set]).
> > Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID
> >
> 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5
> B
> > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by
> > peer Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary
> > -> Unknown ) conn( SyncSource -> BrokenPipe ) Oct 12 06:58:52 ha14b
> > kernel: d-con ha02_mysql: short read (expected size 16) Oct 12
> > 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed Oct 12
> > 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe ->
> > Unconnected ) Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver
> > terminated Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting
> > receiver thread Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql:
> > receiver (re)started Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql:
> > conn( Unconnected -> WFConnection )
> >
> > </snip>
> >
> > However, I can guarantee that the network connection is solid. Running
> ping flood, I get 30,000 packets sent with no loss or latency.
> >
> > Help, please?
> >
> > --
> > Eric Robinson
> >
> 
> > _______________________________________________
> > drbd-user mailing list
> > drbd-user at lists.linbit.com
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> 
> 
> --
> Regards
> 
> Viktor Villafuerte
> Optus Internet Engineering
> t: +61 2 80825265