[DRBD-user] "PingAck not received" messages

Tue May 22 17:05:17 CEST 2012

On 22/05/12 13:05, Florian Haas wrote:
>> Indeed, I started logging this command every second, and e.g. this kind
>> of event is typical every few hours:
>>
>>  dd if=/dev/zero of=/dev/drbd13 conv=fdatasync bs=1M count=1 2>&1 | \
>>    grep copied
>>
>> 2012-05-22 02:17:00 W 1048576 bytes (1.0 MB) copied, 0.0115253 s, 91.0 MB/s
>> 2012-05-22 02:17:01 W 1048576 bytes (1.0 MB) copied, 0.011519 s, 91.0 MB/s
>> 2012-05-22 02:17:02 W 1048576 bytes (1.0 MB) copied, 0.0116563 s, 90.0 MB/s
>> 2012-05-22 02:17:03 W 1048576 bytes (1.0 MB) copied, 1.1898 s, 881 kB/s
>> 2012-05-22 02:17:05 W 1048576 bytes (1.0 MB) copied, 28.3202 s, 37.0 kB/s
>> 2012-05-22 02:17:35 W 1048576 bytes (1.0 MB) copied, 0.0127468 s, 82.3 MB/s
>> 2012-05-22 02:17:36 W 1048576 bytes (1.0 MB) copied, 0.0113499 s, 92.4 MB/s
>> 2012-05-22 02:17:37 W 1048576 bytes (1.0 MB) copied, 0.0112707 s, 93.0 MB/s
>>
>> And in the kernel log:
>>
>> May 22 02:17:11 v3a kernel: [1341064.126449] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967295
>> May 22 02:17:17 v3a kernel: [1341070.129829] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967294
>> May 22 02:17:23 v3a kernel: [1341076.133170] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967293
>> May 22 02:17:29 v3a kernel: [1341082.133592] block drbd13:
>> [drbd13_worker/797] sock_sendmsg time expired, ko = 4294967292
>>
>> Curiously, the "v3a" host (on which I'm running this test) just shows
>> these disconnects, it's the "v3b" host that gives the "PingAck not
>> received" messages.  But not in this instance.
> 
> And you're absolutely certain that "ip -s link show dev <nic>" shows
> no drops or errors (for <nic> replaced with your DRBD replication
> device name)? Also, if you're replicating through a switch as opposed
> to via a back-to-back connection, did you check the switch port
> statistics for errors as well?

I hadn't spotted any but I've asked for a third opinion, because I can't
see any other explanation.  I've also set up a continuous TCP connection
on the same interfaces between a pair of scripts that will report if
there is any break in TCP connectivity, as it appears drbd is  seeing.

> I will observe that up to this point you haven't shared your DRBD
> config, so if you pastebinned that and shared the URL here, people
> could look into it and point to possible issues.

I'm not using drbdadm and the helper, my "pairvm" script manages DRBD
for VMs using this command to attach the disc:

      drbd.setup("disk", drbd_backing_device, drbd_meta_device, 0)

and these commands to connect the network, pretty unambitious stuff I
assume:

      drbd.setup("net", *(drbd_net_args + extra_options))
      drbd.setup("syncer", "-r", "20M")
      def drbd_net_args
        [
        "#{Global.ips['here_drbd']}:#{drbd_port}",
        "#{Global.ips['there_drbd']}:#{drbd_port}",
          "B",
          "--after-sb-0pri", "discard-zero-changes",
          "--after-sb-1pri", "consensus",
          "--after-sb-2pri", "disconnect",
          "--ping-timeout", "50"
        ]
      end

NB these just translate to "drbdsetup /dev/drbdX net ..." with the IPs
and ports automatically assigned by the script.

-- 
Matthew

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20120522/25c73b63/attachment.pgp>