Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
> -----Original Message----- > From: Viktor Villafuerte [mailto:viktor.villafuerte at optusnet.com.au] > Sent: Wednesday, October 12, 2016 3:19 PM > To: Eric Robinson <eric.robinson at psmnv.com> > Cc: drbd-user at lists.linbit.com > Subject: Re: [DRBD-user] DRBD constantly re-syncing, getting to 100%, > starting over. What? > > Hi Eric, > > I've had the pleasure to deal with this exact issue, and in prod too :O > > > On Wed 12 Oct 2016 14:04:48, Eric Robinson wrote: > > This morning we are seeing an issue where drbd is repeatedly resyncing, > getting to 100%, and starting over, and never getting to an > UpToDate/UpToDate state. > > > > On one node, it is logging this sequence over and over… > > > > <snip> > > > > Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender > > thread (from drbd_r_ha02_mys [804]) Oct 12 06:56:11 ha14a kernel: block > drbd1: drbd_sync_handshake: > > Oct 12 06:56:11 ha14a kernel: block drbd1: self > > > 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D > 9 > > bits:0 flags:0 Oct 12 06:56:11 ha14a kernel: block drbd1: peer > > > 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5 > B bits:0 flags:0 Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1 > by rule 50 Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target > due to disk states. > > Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary ) > > conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct > > 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats > > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > > Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats > > [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > > Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT -> > > WFSyncUUID ) Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync > > uuid > > > 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D > 9 > > Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: > > /sbin/drbdadm before-resync-target minor-1 Oct 12 06:56:11 ha14a > > kernel: block drbd1: helper command: /sbin/drbdadm > > before-resync-target minor-1 exit code 0 (0x0) Oct 12 06:56:11 ha14a > kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) Oct 12 06:56:11 ha14a > kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]). > > The two lines below are the important lines, where DRBD assumes network > failure due PingAck not arrving in time. > > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in > time. > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary -> > > Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> > > DUnknown ) > > You need to increase the timeout withing which PingAck is expected. > > drbdadm net-options -v --ping-timeout=10 drbd0 > > this is the command I used. The --ping-timeout is in 10th of a second so value > of '10' is actually 1s. Please confirm this in documentation as the version of > DRBD I run this on was 8.x > > Also you may need to tweak the timeout a bit.. > > > Hope this helps > Thanks much for the input. It was an emergency, so I rebooted the node. After it came back up, all was well. The cluster has been fine for months and now it is apparently fine again. I can't imagine why it would suddenly behave this way. > v > > > > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated Oct > > 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating > > drbd_a_ha02_mys Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: > > Connection closed Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: > > conn( NetworkFailure -> Unconnected ) Oct 12 06:56:12 ha14a kernel: > > d-con ha02_mysql: receiver terminated Oct 12 06:56:12 ha14a kernel: > > d-con ha02_mysql: Restarting receiver thread Oct 12 06:56:12 ha14a > > kernel: d-con ha02_mysql: receiver (re)started Oct 12 06:56:12 ha14a > > kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection ) Oct 12 > > 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed > > network protocol version 101 Oct 12 06:56:12 ha14a kernel: d-con > > ha02_mysql: Peer authenticated using 20 bytes HMAC Oct 12 06:56:12 > > ha14a kernel: d-con ha02_mysql: conn( WFConnection -> WFReportParams > ) > > > > </snip> > > > > On the other node, it is saying this over and over… > > > > <snip> > > > > Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake: > > Oct 12 06:58:51 ha14b kernel: block drbd1: self > > > 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5 > B > > bits:0 flags:0 Oct 12 06:58:51 ha14b kernel: block drbd1: peer > > > 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D > 9 bits:0 flags:0 Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1 > by rule 70 Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source > due to disk states. > > Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary > > ) conn( WFReportParams -> WFBitMapS ) Oct 12 06:58:51 ha14b kernel: > > block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE > > 23(1), total 23; compression: 100.0% Oct 12 06:58:51 ha14b kernel: > > block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE > > 23(1), total 23; compression: 100.0% Oct 12 06:58:51 ha14b kernel: > > block drbd1: helper command: /sbin/drbdadm before-resync-source > minor-1 Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: > /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) Oct 12 > 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource ) Oct 12 > 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will sync 0 > KB [0 bits set]). > > Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID > > > 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5 > B > > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by > > peer Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary > > -> Unknown ) conn( SyncSource -> BrokenPipe ) Oct 12 06:58:52 ha14b > > kernel: d-con ha02_mysql: short read (expected size 16) Oct 12 > > 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed Oct 12 > > 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe -> > > Unconnected ) Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver > > terminated Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting > > receiver thread Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: > > receiver (re)started Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: > > conn( Unconnected -> WFConnection ) > > > > </snip> > > > > However, I can guarantee that the network connection is solid. Running > ping flood, I get 30,000 packets sent with no loss or latency. > > > > Help, please? > > > > -- > > Eric Robinson > > > > > _______________________________________________ > > drbd-user mailing list > > drbd-user at lists.linbit.com > > http://lists.linbit.com/mailman/listinfo/drbd-user > > > -- > Regards > > Viktor Villafuerte > Optus Internet Engineering > t: +61 2 80825265