[DRBD-user] PingAck did not arrive in time.

Dirk Bonenkamp - ProActive dirk at proactive.nl
Thu May 24 07:54:02 CEST 2018


Hello,

Thank you for your suggestion. The MTU is 1500 on both nodes. I had it
at 9000, but reverted everything to 'normal' to debug this problem.
Pinging as in your example works fine.

Cheers,

Dirk

On 23-05-18 21:22, Nelson Hicks wrote:
> Is there any chance this could be an MTU mismatch between the two
> nodes? If you use ping with varying packet sizes from one node to the
> other, do they stop working above a specific size? Does ifconfig
> report the same MTU size for the interface on both nodes?
>
> Examples:
>
> ifconfig | grep MTU
>
> ping -s 500 <other_ip>
>
> ping -s 1400 <other_ip>
>
> ping -s 1472 <other_ip>
>
> ping -s 2000 <other_ip>
>
> Thanks,
>
> - Nelson Hicks
>
>
>
>
> On 05/23/2018 02:07 PM, Dirk Bonenkamp - ProActive wrote:
>> Hi,
>>
>> Thank you for your reply.
>>
>> I am / was under the impression that DRBD9 is the new and improved
>> DRBD, so I figured to use this version. But this is not the case?
>> Could somebody enlighten me a bit?
>>
>> I already have disabled all bonding and other fancy network stuff, so
>> I'm using  1 nic currently. This doesn't solve anything unfortunately.
>>
>> Kind regards,
>>
>> Dirk
>>
>> On 23-05-18 14:20, Yannis Milios wrote:
>>> Two things:
>>>
>>> - I would use drbd8 instead of drbd9 for a 2 node setup.
>>> - I would first test with 1 nic instead of 2.
>>>
>>> On Wed, May 23, 2018 at 11:01 AM, Dirk Bonenkamp - ProActive
>>> <dirk at proactive.nl <mailto:dirk at proactive.nl>> wrote:
>>>
>>>     Hi List,
>>>
>>>     I'm struggling with a new DRBD9 setup. It's a simple Master/Slave
>>>     setup.
>>>     I'm running Ubuntu 16.04 LTS with the DRBD9 packages from the
>>>     Launchpad PPA.
>>>
>>>     I'm running some DRBD8 systems in production for quite some
>>>     years, so I
>>>     have some experience. This setup is very similar, the only major
>>>     difference is that this is DRBD9 and I use LUKS encrypted
>>>     partitions as
>>>     backend.
>>>
>>>     I keep running into this 'PingAck did not arrive in time.' error,
>>>     which
>>>     points to network issues if I am correct (see complete log snippet
>>>     below). This error occurs when I try to reattach the secondary node
>>>     after a reboot. Initial sync works fine.
>>>
>>>     The servers are interconnected with 2 10Gb NICs. I had bonding &
>>>     jumbo
>>>     frames configured, but deactivated all this, to no avail. I've also
>>>     stripped the DRBD configuration to the bare minimum (see below).
>>>
>>>     I've tested the connection with iperf and some other tools and it
>>>     seems
>>>     just fine.
>>>
>>>     Could somebody point me in the right direction?
>>>
>>>     Thank you in advance, regards,
>>>
>>>     Dirk Bonenkamp
>>>
>>>     syslog messages:
>>>
>>>     May 23 11:31:56 data2 kernel: [  704.111755] drbd: loading
>>>     out-of-tree
>>>     module taints kernel.
>>>     May 23 11:31:56 data2 kernel: [  704.112290] drbd: module
>>>     verification
>>>     failed: signature and/or required key missing - tainting kernel
>>>     May 23 11:31:56 data2 kernel: [  704.127677] drbd: initialized.
>>>     Version:
>>>     9.0.14-1 (api:2/proto:86-113)
>>>     May 23 11:31:56 data2 kernel: [  704.127680] drbd: GIT-hash:
>>>     62f906cf44ef02a30ce0c148fec223b40c51c533 build by root at data2,
>>>     2018-05-23
>>>     09:19:54
>>>     May 23 11:31:56 data2 kernel: [  704.127683] drbd: registered as
>>>     block
>>>     device major 147
>>>     May 23 11:31:56 data2 kernel: [  704.153565] drbd r0: Starting
>>> worker
>>>     thread (from drbdsetup [4495])
>>>     May 23 11:31:56 data2 kernel: [  704.183031] drbd r0/0 drbd0: disk(
>>>     Diskless -> Attaching )
>>>     May 23 11:31:56 data2 kernel: [  704.183066] drbd r0/0 drbd0:
>>> Maximum
>>>     number of peer devices = 1
>>>     May 23 11:31:56 data2 kernel: [  704.183293] drbd r0: Method to
>>>     ensure
>>>     write ordering: flush
>>>     May 23 11:31:56 data2 kernel: [  704.183308] drbd r0/0 drbd0:
>>>     drbd_bm_resize called with capacity == 273437203064
>>>     May 23 11:31:58 data2 kernel: [  706.508228] drbd r0/0 drbd0:
>>> resync
>>>     bitmap: bits=34179650383 words=534057038 pages=1043081
>>>     May 23 11:31:58 data2 kernel: [  706.508234] drbd r0/0 drbd0:
>>>     size = 127
>>>     TB (136718601532 KB)
>>>     May 23 11:31:58 data2 kernel: [  706.508236] drbd r0/0 drbd0:
>>>     size = 127
>>>     TB (136718601532 KB)
>>>     May 23 11:32:10 data2 kernel: [  717.890420] drbd r0/0 drbd0:
>>>     recounting
>>>     of set bits took additional 1256ms
>>>     May 23 11:32:10 data2 kernel: [  717.890435] drbd r0/0 drbd0: disk(
>>>     Attaching -> Outdated )
>>>     May 23 11:32:10 data2 kernel: [  717.890439] drbd r0/0 drbd0:
>>>     attached
>>>     to current UUID: 244DD61D2781DF44
>>>     May 23 11:32:10 data2 kernel: [  717.918473] drbd r0 data1:
>>> Starting
>>>     sender thread (from drbdsetup [4544])
>>>     May 23 11:32:10 data2 kernel: [  717.922534] drbd r0 data1: conn(
>>>     StandAlone -> Unconnected )
>>>     May 23 11:32:10 data2 kernel: [  717.922820] drbd r0 data1:
>>> Starting
>>>     receiver thread (from drbd_w_r0 [4498])
>>>     May 23 11:32:10 data2 kernel: [  717.922973] drbd r0 data1: conn(
>>>     Unconnected -> Connecting )
>>>     May 23 11:32:10 data2 kernel: [  718.421219] drbd r0 data1:
>>>     Handshake to
>>>     peer 1 successful: Agreed network protocol version 113
>>>     May 23 11:32:10 data2 kernel: [  718.421229] drbd r0 data1: Feature
>>>     flags enabled on protocol level: 0xf TRIM THIN_RESYNC WRITE_SAME
>>>     WRITE_ZEROES.
>>>     May 23 11:32:10 data2 kernel: [  718.421259] drbd r0 data1:
>>> Starting
>>>     ack_recv thread (from drbd_r_r0 [4550])
>>>     May 23 11:32:10 data2 kernel: [  718.424095] drbd r0: Preparing
>>>     cluster-wide state change 1205605755 (0->1 499/146)
>>>     May 23 11:32:10 data2 kernel: [  718.437172] drbd r0: State change
>>>     1205605755: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
>>>     May 23 11:32:10 data2 kernel: [  718.437185] drbd r0: Aborting
>>>     cluster-wide state change 1205605755 (12ms) rv = -22
>>>     May 23 11:32:12 data2 kernel: [  719.896223] drbd r0: Preparing
>>>     cluster-wide state change 445952355 (0->1 499/146)
>>>     May 23 11:32:12 data2 kernel: [  719.896498] drbd r0: State change
>>>     445952355: primary_nodes=2, weak_nodes=FFFFFFFFFFFFFFFC
>>>     May 23 11:32:12 data2 kernel: [  719.896508] drbd r0: Committing
>>>     cluster-wide state change 445952355 (0ms)
>>>     May 23 11:32:12 data2 kernel: [  719.896541] drbd r0 data1: conn(
>>>     Connecting -> Connected ) peer( Unknown -> Primary )
>>>     May 23 11:32:12 data2 kernel: [  719.912186] drbd r0/0 drbd0 data1:
>>>     drbd_sync_handshake:
>>>     May 23 11:32:12 data2 kernel: [  719.912198] drbd r0/0 drbd0
>>>     data1: self
>>>     244DD61D2781DF44:0000000000000000:0000000000000000:0000000000000000
>>>     bits:52035 flags:20
>>>     May 23 11:32:12 data2 kernel: [  719.912207] drbd r0/0 drbd0
>>>     data1: peer
>>>     E38BE51FE782EAE0:244DD61D2781DF44:934CAB8662DF0410:E555BDC58E528356
>>>     bits:53162 flags:20
>>>     May 23 11:32:12 data2 kernel: [  719.912214] drbd r0/0 drbd0 data1:
>>>     uuid_compare()=-2 by rule 50
>>>     May 23 11:32:12 data2 kernel: [  719.912248] drbd r0/0 drbd0 data1:
>>>     pdsk( DUnknown -> UpToDate ) repl( Off -> WFBitMapT )
>>>     May 23 11:32:32 data2 kernel: [  740.397026] drbd r0 data1:
>>>     PingAck did
>>>     not arrive in time.
>>>     May 23 11:32:32 data2 kernel: [  740.397121] drbd r0 data1: conn(
>>>     Connected -> NetworkFailure ) peer( Primary -> Unknown )
>>>     May 23 11:32:32 data2 kernel: [  740.397131] drbd r0/0 drbd0 data1:
>>>     pdsk( UpToDate -> DUnknown ) repl( WFBitMapT -> Off )
>>>     May 23 11:32:32 data2 kernel: [  740.397176] drbd r0 data1:
>>>     ack_receiver
>>>     terminated
>>>     May 23 11:32:32 data2 kernel: [  740.397182] drbd r0 data1:
>>>     Terminating
>>>     ack_recv thread
>>>     May 23 11:32:32 data2 kernel: [  740.458608] drbd r0 data1:
>>>     Connection
>>>     closed
>>>     May 23 11:32:32 data2 kernel: [  740.458650] drbd r0 data1: conn(
>>>     NetworkFailure -> Unconnected )
>>>     May 23 11:32:32 data2 kernel: [  740.458688] drbd r0 data1:
>>>     Restarting
>>>     receiver thread
>>>     May 23 11:32:32 data2 kernel: [  740.458723] drbd r0 data1: conn(
>>>     Unconnected -> Connecting )
>>>
>>>     resources:
>>>
>>>     resource r0 {
>>>             on data1 {
>>>                     device    /dev/drbd0;
>>>                     disk      /dev/mapper/mapper_secure;
>>>                     address 172.16.11.21:7789
>>> <http://172.16.11.21:7789>;
>>>                     meta-disk internal;
>>>             }
>>>             on data2 {
>>>                     device    /dev/drbd0;
>>>                     disk      /dev/mapper/mapper_secure;
>>>                     address 172.16.11.22:7789
>>> <http://172.16.11.22:7789>;
>>>                     meta-disk internal;
>>>             }
>>>     }
>>>
>>>     drbd configuration:
>>>
>>>     global {
>>>             usage-count yes;
>>>     }
>>>
>>>     common {
>>>             #handlers {
>>>             #        fence-peer "/usr/lib/drbd/crm-fence-peer.9.sh
>>>     <http://crm-fence-peer.9.sh>";
>>>             #        after-resync-target
>>>     "/usr/lib/drbd/crm-unfence-peer.9.sh
>>> <http://crm-unfence-peer.9.sh>";
>>>             #}
>>>             #disk {
>>>             #        on-io-error detach;
>>>             #       disk-barrier no;
>>>             #       disk-flushes no;
>>>             #       al-extents 3833;
>>>             #        c-plan-ahead 7;
>>>             #        c-fill-target 2M;
>>>             #        c-min-rate 80M;
>>>             #        c-max-rate 720M;
>>>             #}
>>>             net {
>>>                     protocol C;
>>>                     #fencing resource-only;
>>>                     #cram-hmac-alg sha1;
>>>                     #verify-alg sha1;
>>>                     #shared-secret 1e69dc721fd2e65368ae3ba1e5929979;
>>>                     #after-sb-0pri disconnect;
>>>                     #after-sb-1pri disconnect;
>>>                     #after-sb-2pri disconnect;
>>>                     #max-buffers    8000;
>>>                     #max-epoch-size 8000;
>>>                     #sndbuf-size 0;
>>>                     #rcvbuf-size 2048k;
>>>             }
>>>     }
>>>
>>>
>>>
>>>     _______________________________________________
>>>     drbd-user mailing list
>>>     drbd-user at lists.linbit.com <mailto:drbd-user at lists.linbit.com>
>>>     http://lists.linbit.com/mailman/listinfo/drbd-user
>>>     <http://lists.linbit.com/mailman/listinfo/drbd-user>
>>>
>>>
>>
>> -- 
>> ProActive Software
>> Dirk Bonenkamp
>> CTO         <https://www.proactive-software.com>
>> Phone: +31 (0)23 54 222 99
>> Mobile: +31 (0)6 250 787 93     Richard Holkade 9
>> 2033 PZ Haarlem
>> LinkedIn <http://linkd.in/1V6egnk> Facebook
>> <http://bit.ly/FBProActive> YouTube <http://bit.ly/1Mc23L9>
>> www.proactive.nl <https://www.proactive.nl>
>>
>>
>>
>> _______________________________________________
>> drbd-user mailing list
>> drbd-user at lists.linbit.com
>> http://lists.linbit.com/mailman/listinfo/drbd-user
>
> _______________________________________________
> drbd-user mailing list
> drbd-user at lists.linbit.com
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
ProActive Software
Dirk Bonenkamp
CTO 	  	<https://www.proactive-software.com>
Phone: +31 (0)23 54 222 99
Mobile: +31 (0)6 250 787 93 	Richard Holkade 9
2033 PZ Haarlem
LinkedIn <http://linkd.in/1V6egnk>  Facebook
<http://bit.ly/FBProActive>  YouTube <http://bit.ly/1Mc23L9>
www.proactive.nl <https://www.proactive.nl>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20180524/ba0b3f50/attachment-0001.htm>


More information about the drbd-user mailing list