[DRBD-user] PingAck did not arrive in time.

Fri Jun 21 09:16:44 CEST 2019

I've tried to update the kernel and DRBD 9 to the last available version, but nothing helped with the PingAck issue. So I had to downgrade to DRBD 8.4 which is started to replicate fine except the following messages in the log: 
[139267.930516] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.930529] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight[139267.933231] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.933241] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight[139267.934064] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.934075] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight[139267.934942] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.934950] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight[139267.936012] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.936019] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight[139267.936818] block drbd0: BAD! enr=34406392 rs_left=-4 rs_failed=0 count=4 cstate=SyncTarget[139267.936825] block drbd0: start offset (-2092958208) too large in drbd_bm_e_weight
Any ideas how to resolve this problem?
-----Original Message-----From: Oleksiy Evin <o.evin at onefc.com>To: drbd-user at lists.linbit.comSubject: PingAck did not arrive in time.Date: Sun, 16 Jun 2019 22:37:08 +0800
Hi All,
Can anyone help me on the "PingAck did not arrive in time." repeated error while initial bitmap synchronization? First time it happened after I updated our cluster with the latest centos and drbd updates. I'm using the basic drbd configuration on 527TB LVM volume, replicated on 2 nodes with cross-over 100Gbps connection. The same connection is used for Pacemaker without any problems. I don't see any network adapter errors in logs, no reconnects or packets drop when the error happens for drbd. I've also tried another adapter with 10Gbps direct cable connection and got the same error.
# rpm -q centos-release
centos-release-7-6.1810.2.el7.centos.x86_64
# yum list installed | grep drbd
drbd90-utils.x86_64                           9.6.0-1.el7.elrepo       @elrepo  kmod-drbd90.x86_64                            9.0.16-1.el7_6.elrepo    @elrepo  
# ifconfig
ens2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500        inet 172.16.1.1  netmask 255.255.255.255  broadcast 172.16.1.1        ether b8:83:03:67:3f:d4  txqueuelen 1000  (Ethernet)        RX packets 63547  bytes 11147564 (10.6 MiB)        RX errors 0  dropped 0  overruns 0  frame 0        TX packets 265307  bytes 33045583 (31.5 MiB)        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

eno8: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500        inet 172.20.1.1  netmask 255.255.0.0  broadcast 172.20.255.255        ether 20:67:7c:1c:42:c6  txqueuelen 1000  (Ethernet)        RX packets 484  bytes 49086 (47.9 KiB)        RX errors 0  dropped 0  overruns 0  frame 0        TX packets 504  bytes 56974 (55.6 KiB)        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0        device interrupt 116  memory 0xe3000000-e37fffff  
# drbdadm dump all
# /etc/drbd.confglobal {    usage-count no;}
common {    options {        auto-promote     yes;    }    net {        protocol           C;    }}
# resource r0 on sgpplhan01: not ignored, not stacked# defined at /etc/drbd.d/r0.res:1resource r0 {    volume 0 {        device           /dev/drbd0 minor 0;        disk             /dev/storage/data;        meta-disk        internal;    }    on sgpplhan01 {        node-id 0;        address          ipv4 172.16.1.1:7788;    }    on sgpplhan02 {        node-id 1;        address          ipv4 172.16.2.1:7788;    }    net {        after-sb-0pri    discard-zero-changes;        after-sb-1pri    consensus;        after-sb-2pri    disconnect;    }}
# dmesg | grep drbd
[37259.335235] drbd r0/0 drbd0 sgpplhan02: drbd_sync_handshake:[37259.335245] drbd r0/0 drbd0 sgpplhan02: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:141608532581 flags:24[37259.335254] drbd r0/0 drbd0 sgpplhan02: peer B7DA5A657F09CD92:45B54292B9CBC0CF:0000000000000000:0000000000000000 bits:141608532581 flags:20[37259.335260] drbd r0/0 drbd0 sgpplhan02: uuid_compare()=-3 by rule 20[37259.335265] drbd r0/0 drbd0 sgpplhan02: Writing the whole bitmap, full sync required after drbd_sync_handshake.[37265.754528] drbd r0/0 drbd0 sgpplhan02: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )[37265.754546] drbd r0/0 drbd0: Resumed AL updates[37279.780140] drbd r0 sgpplhan02: PingAck did not arrive in time.[37279.781303] drbd r0 sgpplhan02: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )[37279.781313] drbd r0/0 drbd0 sgpplhan02: pdsk( UpToDate -> DUnknown ) repl( WFBitMapT -> Off )[37279.781371] drbd r0 sgpplhan02: ack_receiver terminated[37279.781376] drbd r0 sgpplhan02: Terminating ack_recv thread[37279.833051] drbd r0 sgpplhan02: Connection closed[37279.833069] drbd r0 sgpplhan02: conn( NetworkFailure -> Unconnected )[37279.833086] drbd r0 sgpplhan02: Restarting receiver thread[37279.833098] drbd r0 sgpplhan02: conn( Unconnected -> Connecting )[37308.171618] drbd r0 sgpplhan02: Handshake to peer 1 successful: Agreed network protocol version 114[37308.171628] drbd r0 sgpplhan02: Feature flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME WRITE_ZEROES.[37308.171666] drbd r0 sgpplhan02: Starting ack_recv thread (from drbd_r_r0 [28699])[37308.217846] drbd r0: Preparing cluster-wide state change 686534516 (0->1 499/146)[37308.218242] drbd r0: State change 686534516: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC[37308.218253] drbd r0: Committing cluster-wide state change 686534516 (0ms)[37308.218296] drbd r0 sgpplhan02: conn( Connecting -> Connected ) peer( Unknown -> Primary )[37308.222753] drbd r0/0 drbd0 sgpplhan02: drbd_sync_handshake:[37308.222763] drbd r0/0 drbd0 sgpplhan02: self 0000000000000004:0000000000000000:0000000000000000:0000000000000000 bits:141608532581 flags:124[37308.222771] drbd r0/0 drbd0 sgpplhan02: peer B7DA5A657F09CD92:45B54292B9CBC0CF:0000000000000000:0000000000000000 bits:141608532581 flags:120[37308.222777] drbd r0/0 drbd0 sgpplhan02: uuid_compare()=-3 by rule 20[37308.222782] drbd r0/0 drbd0 sgpplhan02: Writing the whole bitmap, full sync required after drbd_sync_handshake.[37314.890717] drbd r0/0 drbd0 sgpplhan02: pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )[37328.669598] drbd r0 sgpplhan02: PingAck did not arrive in time.[37328.670759] drbd r0 sgpplhan02: conn( Connected -> NetworkFailure ) peer( Primary -> Unknown )[37328.670770] drbd r0/0 drbd0 sgpplhan02: pdsk( UpToDate -> DUnknown ) repl( WFBitMapT -> Off )[37328.670823] drbd r0 sgpplhan02: ack_receiver terminated[37328.670828] drbd r0 sgpplhan02: Terminating ack_recv thread[37328.718096] drbd r0 sgpplhan02: Connection closed[37328.718112] drbd r0 sgpplhan02: conn( NetworkFailure -> Unconnected )[37328.718127] drbd r0 sgpplhan02: Restarting receiver thread[37328.718138] drbd r0 sgpplhan02: conn( Unconnected -> Connecting )[37351.755553] drbd r0 sgpplhan02: conn( Connecting -> Disconnecting )[37351.794081] drbd r0 sgpplhan02: Connection closed


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20190621/1a007eca/attachment-0001.htm>