[DRBD-user] "PingAck not received" messages

Wed May 16 22:11:05 CEST 2012

I'm trying to understand a symptom for a client who uses drbd to run
sets of virtual machines between three pairs of servers (v1a/v1b,
v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O
is buffered depending on what mode is chosen, and buffer settings.

Firstly, it surprised me that even in replication mode "A", the system
still seemed limited by by the bandwidth between nodes.  I found this
out when the customer's bonded interface had flipped over to its 100Mb
backup connection, and suddenly they had I/O problems.  While I was
investigating this and running tests, I noticed that switching to mode A
didn't help, even when measuring short transfers that I'd expect would
fit into reasonable-sized buffers.  What kind of buffer size can I
expect from an "auto-tuned" DRBD?  It seems important to be able to
cover bursts without leaning on the network, so I'd like to know whether
that's possible with some special tuning.

The other problem is the "PingAck not received" messages that have been
littering the logs of the v3a/v3b servers for the last couple of weeks,
e.g. this has been happening every few hours for one DRBD or another:

May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did
not arrive in time.
May 14 08:21:45 v3b kernel: [661127.875553] block drbd10: peer( Primary
-> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate ->
DUnknown )
May 14 08:21:45 v3b kernel: [661127.875562] block drbd10: asender terminated
May 14 08:21:45 v3b kernel: [661127.875564] block drbd10: Terminating
drbd10_asender
May 14 08:21:45 v3b kernel: [661127.875597] block drbd10: short read
expecting header on sock: r=-512
May 14 08:21:45 v3b kernel: [661127.882896] block drbd10: Connection closed
May 14 08:21:45 v3b kernel: [661127.882899] block drbd10: conn(
NetworkFailure -> Unconnected )
May 14 08:21:45 v3b kernel: [661127.882904] block drbd10: receiver
terminated
May 14 08:21:45 v3b kernel: [661127.882908] block drbd10: Restarting
drbd10_receiver
May 14 08:21:45 v3b kernel: [661127.882910] block drbd10: receiver
(re)started
May 14 08:21:45 v3b kernel: [661127.882913] block drbd10: conn(
Unconnected -> WFConnection )
May 14 08:21:46 v3b kernel: [661129.123506] block drbd10: Handshake
successful: Agreed network protocol version 91
May 14 08:21:46 v3b kernel: [661129.123511] block drbd10: conn(
WFConnection -> WFReportParams )
May 14 08:21:46 v3b kernel: [661129.123535] block drbd10: Starting
asender thread (from drbd10_receiver [31418])
May 14 08:21:46 v3b kernel: [661129.123876] block drbd10:
data-integrity-alg: <not-used>
May 14 08:21:46 v3b kernel: [661129.123898] block drbd10:
drbd_sync_handshake:
May 14 08:21:46 v3b kernel: [661129.123900] block drbd10: self
C5DC68A8AFD5BFEC:0000000000000000:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123903] block drbd10: peer
F8BB238D22A7ACFF:C5DC68A8AFD5BFED:7EB45F3A26B3BD72:2EC9659EFC4BC513
bits:0 flags:0
May 14 08:21:46 v3b kernel: [661129.123905] block drbd10:
uuid_compare()=-1 by rule 50
May 14 08:21:46 v3b kernel: [661129.123908] block drbd10: peer( Unknown
-> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown ->
UpToDate )
May 14 08:21:46 v3b kernel: [661129.138101] block drbd10: conn(
WFBitMapT -> WFSyncUUID )
May 14 08:21:46 v3b kernel: [661129.139563] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10
May 14 08:21:46 v3b kernel: [661129.140282] block drbd10: helper
command: /sbin/drbdadm before-resync-target minor-10 exit code 0 (0x0)
May 14 08:21:46 v3b kernel: [661129.140286] block drbd10: conn(
WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent )
May 14 08:21:46 v3b kernel: [661129.140292] block drbd10: Began resync
as SyncTarget (will sync 0 KB [0 bits set]).
May 14 08:21:47 v3b kernel: [661129.693954] block drbd10: Resync done
(total 1 sec; paused 0 sec; 0 K/sec)
May 14 08:21:47 v3b kernel: [661129.693961] block drbd10: conn(
SyncTarget -> Connected ) disk( Inconsistent -> UpToDate )
May 14 08:21:47 v3b kernel: [661129.693969] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10
May 14 08:21:47 v3b kernel: [661129.694725] block drbd10: helper
command: /sbin/drbdadm after-resync-target minor-10 exit code 0 (0x0)

I've not been able to correlate these ping drops and reconnections to
any of:

1) interface capacity issues (a few times we might make a 400Mb spike,
but sometimes there's none at all);

2) loss of connectivity or ARP problems on the two servers' dedicated
DRBD interfaces (i.e. I've got an unbroken log of pings between the two
servers);

3) any kernel grumbles about the network interface, bonding, RAID or
anything remotely hardware-related.  Apart from the drbd messages
there's no other chatter from the kernel.

The customer's other two pairs of servers have been running 18 months
and not exhibited this behaviour.

The customer hasn't given me the data to show these blips (which are
anything from 2s-30s) correspond to any real performance problems and I
don't have access to the inside of their VMs to check for myself.  So my
questions are - would you expect these disconnections to cause
variations in I/O bandwidth or responsiveness?

And secondly, what should I be doing about it?  My unsatisfactory
response to the customer's worry is to reconnect all the drbds with a
longer ping-timeout, and in 10 hours it hasn't reoccurred, which is an
unusually long record.  I will be more convinced by the end of the day.

Even if that does solve these messages, I'm curious as to the cause.
We've not hit a network bandwidth ceiling, and so we've definitely not
hit an I/O ceiling (which is 4x146GB 15kRPM discs, RAID10, HP RAID).  I
can accept that some VMs will use more bandwidth than others, and so it
wouldn't be surprising that one VM on the machine was the "cause".

But when the disconnections happen, they appear to be completely random.
 Checking with grep/uniq -c, I see out of the 11 devices on the systems,
it happens pretty randomly (and drbd10 is just a test, getting
absolutely zero I/O).

      5 drbd0:
      5 drbd1:
     11 drbd2:
      8 drbd3:
     11 drbd4:
      4 drbd5:
      6 drbd6:
      7 drbd7:
      5 drbd8:
     14 drbd9:
     12 drbd10:
      7 drbd11:

So even if upping the ping time stops the problem, and even if the
effects of the disconnect/reconnect cycles are harmless - why might DRBD
exhibit these symptoms on one pair of servers, but not two other sets?
Is there some I/O pattern that might cause pings to get lost, even over
a lightly-loaded gigabit link?

Thanks for any insights in advance.

-- 
Matthew