Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi Eric, I've had the pleasure to deal with this exact issue, and in prod too :O On Wed 12 Oct 2016 14:04:48, Eric Robinson wrote: > This morning we are seeing an issue where drbd is repeatedly resyncing, getting to 100%, and starting over, and never getting to an UpToDate/UpToDate state. > > On one node, it is logging this sequence over and over… > > <snip> > > Oct 12 06:56:11 ha14a kernel: d-con ha02_mysql: Starting asender thread (from drbd_r_ha02_mys [804]) > Oct 12 06:56:11 ha14a kernel: block drbd1: drbd_sync_handshake: > Oct 12 06:56:11 ha14a kernel: block drbd1: self 13FB9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 flags:0 > Oct 12 06:56:11 ha14a kernel: block drbd1: peer 38E17129E5821B5F:13FB9B08BF812C5B:13FA9B08BF812C5B:13F99B08BF812C5B bits:0 flags:0 > Oct 12 06:56:11 ha14a kernel: block drbd1: uuid_compare()=-1 by rule 50 > Oct 12 06:56:11 ha14a kernel: block drbd1: Becoming sync target due to disk states. > Oct 12 06:56:11 ha14a kernel: block drbd1: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) > Oct 12 06:56:11 ha14a kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Oct 12 06:56:11 ha14a kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFBitMapT -> WFSyncUUID ) > Oct 12 06:56:11 ha14a kernel: block drbd1: updated sync uuid 13FC9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 > Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 > Oct 12 06:56:11 ha14a kernel: block drbd1: helper command: /sbin/drbdadm before-resync-target minor-1 exit code 0 (0x0) > Oct 12 06:56:11 ha14a kernel: block drbd1: conn( WFSyncUUID -> SyncTarget ) > Oct 12 06:56:11 ha14a kernel: block drbd1: Began resync as SyncTarget (will sync 0 KB [0 bits set]). The two lines below are the important lines, where DRBD assumes network failure due PingAck not arrving in time. > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: PingAck did not arrive in time. > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: peer( Primary -> Unknown ) conn( SyncTarget -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) You need to increase the timeout withing which PingAck is expected. drbdadm net-options -v --ping-timeout=10 drbd0 this is the command I used. The --ping-timeout is in 10th of a second so value of '10' is actually 1s. Please confirm this in documentation as the version of DRBD I run this on was 8.x Also you may need to tweak the timeout a bit.. Hope this helps v > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: asender terminated > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Terminating drbd_a_ha02_mys > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Connection closed > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( NetworkFailure -> Unconnected ) > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver terminated > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Restarting receiver thread > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: receiver (re)started > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection ) > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Handshake successful: Agreed network protocol version 101 > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: Peer authenticated using 20 bytes HMAC > Oct 12 06:56:12 ha14a kernel: d-con ha02_mysql: conn( WFConnection -> WFReportParams ) > > </snip> > > On the other node, it is saying this over and over… > > <snip> > > Oct 12 06:58:51 ha14b kernel: block drbd1: drbd_sync_handshake: > Oct 12 06:58:51 ha14b kernel: block drbd1: self 38E17129E5821B5F:148D9B08BF812C5B:148C9B08BF812C5B:148B9B08BF812C5B bits:0 flags:0 > Oct 12 06:58:51 ha14b kernel: block drbd1: peer 148D9B08BF812C5A:0000000000000000:4B9700420A3698D8:4B9600420A3698D9 bits:0 flags:0 > Oct 12 06:58:51 ha14b kernel: block drbd1: uuid_compare()=1 by rule 70 > Oct 12 06:58:51 ha14b kernel: block drbd1: Becoming sync source due to disk states. > Oct 12 06:58:51 ha14b kernel: block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) > Oct 12 06:58:51 ha14b kernel: block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Oct 12 06:58:51 ha14b kernel: block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 23(1), total 23; compression: 100.0% > Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 > Oct 12 06:58:51 ha14b kernel: block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) > Oct 12 06:58:51 ha14b kernel: block drbd1: conn( WFBitMapS -> SyncSource ) > Oct 12 06:58:51 ha14b kernel: block drbd1: Began resync as SyncSource (will sync 0 KB [0 bits set]). > Oct 12 06:58:51 ha14b kernel: block drbd1: updated sync UUID 38E17129E5821B5F:148E9B08BF812C5B:148D9B08BF812C5B:148C9B08BF812C5B > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: sock was shut down by peer > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: peer( Secondary -> Unknown ) conn( SyncSource -> BrokenPipe ) > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: short read (expected size 16) > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Connection closed > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( BrokenPipe -> Unconnected ) > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver terminated > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: Restarting receiver thread > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: receiver (re)started > Oct 12 06:58:52 ha14b kernel: d-con ha02_mysql: conn( Unconnected -> WFConnection ) > > </snip> > > However, I can guarantee that the network connection is solid. Running ping flood, I get 30,000 packets sent with no loss or latency. > > Help, please? > > -- > Eric Robinson > > _______________________________________________ > drbd-user mailing list > drbd-user at lists.linbit.com > http://lists.linbit.com/mailman/listinfo/drbd-user -- Regards Viktor Villafuerte Optus Internet Engineering t: +61 2 80825265