Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
I'm trying to understand a symptom for a client who uses drbd to run sets of virtual machines between three pairs of servers (v1a/v1b, v2a/v2b, v3a/v3b), and I wanted to understand a bit better how DRBD I/O is buffered depending on what mode is chosen, and buffer settings. Firstly, it surprised me that even in replication mode "A", the system still seemed limited by by the bandwidth between nodes. I found this out when the customer's bonded interface had flipped over to its 100Mb backup connection, and suddenly they had I/O problems. While I was investigating this and running tests, I noticed that switching to mode A didn't help, even when measuring short transfers that I'd expect would fit into reasonable-sized buffers. What kind of buffer size can I expect from an "auto-tuned" DRBD? It seems important to be able to cover bursts without leaning on the network, so I'd like to know whether that's possible with some special tuning. The other problem is the "PingAck not received" messages that have been littering the logs of the v3a/v3b servers for the last couple of weeks, e.g. this has been happening every few hours for one DRBD or another: May 14 08:21:45 v3b kernel: [661127.869500] block drbd10: PingAck did not arrive in time. May 14 08:21:45 v3b kernel: [661127.875553] block drbd10: peer( Primary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown ) May 14 08:21:45 v3b kernel: [661127.875562] block drbd10: asender terminated May 14 08:21:45 v3b kernel: [661127.875564] block drbd10: Terminating drbd10_asender May 14 08:21:45 v3b kernel: [661127.875597] block drbd10: short read expecting header on sock: r=-512 May 14 08:21:45 v3b kernel: [661127.882896] block drbd10: Connection closed May 14 08:21:45 v3b kernel: [661127.882899] block drbd10: conn( NetworkFailure -> Unconnected ) May 14 08:21:45 v3b kernel: [661127.882904] block drbd10: receiver terminated May 14 08:21:45 v3b kernel: [661127.882908] block drbd10: Restarting drbd10_receiver May 14 08:21:45 v3b kernel: [661127.882910] block drbd10: receiver (re)started May 14 08:21:45 v3b kernel: [661127.882913] block drbd10: conn( Unconnected -> WFConnection ) May 14 08:21:46 v3b kernel: [661129.123506] block drbd10: Handshake successful: Agreed network protocol version 91 May 14 08:21:46 v3b kernel: [661129.123511] block drbd10: conn( WFConnection -> WFReportParams ) May 14 08:21:46 v3b kernel: [661129.123535] block drbd10: Starting asender thread (from drbd10_receiver [31418]) May 14 08:21:46 v3b kernel: [661129.123876] block drbd10: data-integrity-alg: <not-used> May 14 08:21:46 v3b kernel: [661129.123898] block drbd10: drbd_sync_handshake: May 14 08:21:46 v3b kernel: [661129.123900] block drbd10: self C5DC68A8AFD5BFEC:0000000000000000:7EB45F3A26B3BD72:2EC9659EFC4BC513 bits:0 flags:0 May 14 08:21:46 v3b kernel: [661129.123903] block drbd10: peer F8BB238D22A7ACFF:C5DC68A8AFD5BFED:7EB45F3A26B3BD72:2EC9659EFC4BC513 bits:0 flags:0 May 14 08:21:46 v3b kernel: [661129.123905] block drbd10: uuid_compare()=-1 by rule 50 May 14 08:21:46 v3b kernel: [661129.123908] block drbd10: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 14 08:21:46 v3b kernel: [661129.138101] block drbd10: conn( WFBitMapT -> WFSyncUUID ) May 14 08:21:46 v3b kernel: [661129.139563] block drbd10: helper command: /sbin/drbdadm before-resync-target minor-10 May 14 08:21:46 v3b kernel: [661129.140282] block drbd10: helper command: /sbin/drbdadm before-resync-target minor-10 exit code 0 (0x0) May 14 08:21:46 v3b kernel: [661129.140286] block drbd10: conn( WFSyncUUID -> SyncTarget ) disk( UpToDate -> Inconsistent ) May 14 08:21:46 v3b kernel: [661129.140292] block drbd10: Began resync as SyncTarget (will sync 0 KB [0 bits set]). May 14 08:21:47 v3b kernel: [661129.693954] block drbd10: Resync done (total 1 sec; paused 0 sec; 0 K/sec) May 14 08:21:47 v3b kernel: [661129.693961] block drbd10: conn( SyncTarget -> Connected ) disk( Inconsistent -> UpToDate ) May 14 08:21:47 v3b kernel: [661129.693969] block drbd10: helper command: /sbin/drbdadm after-resync-target minor-10 May 14 08:21:47 v3b kernel: [661129.694725] block drbd10: helper command: /sbin/drbdadm after-resync-target minor-10 exit code 0 (0x0) I've not been able to correlate these ping drops and reconnections to any of: 1) interface capacity issues (a few times we might make a 400Mb spike, but sometimes there's none at all); 2) loss of connectivity or ARP problems on the two servers' dedicated DRBD interfaces (i.e. I've got an unbroken log of pings between the two servers); 3) any kernel grumbles about the network interface, bonding, RAID or anything remotely hardware-related. Apart from the drbd messages there's no other chatter from the kernel. The customer's other two pairs of servers have been running 18 months and not exhibited this behaviour. The customer hasn't given me the data to show these blips (which are anything from 2s-30s) correspond to any real performance problems and I don't have access to the inside of their VMs to check for myself. So my questions are - would you expect these disconnections to cause variations in I/O bandwidth or responsiveness? And secondly, what should I be doing about it? My unsatisfactory response to the customer's worry is to reconnect all the drbds with a longer ping-timeout, and in 10 hours it hasn't reoccurred, which is an unusually long record. I will be more convinced by the end of the day. Even if that does solve these messages, I'm curious as to the cause. We've not hit a network bandwidth ceiling, and so we've definitely not hit an I/O ceiling (which is 4x146GB 15kRPM discs, RAID10, HP RAID). I can accept that some VMs will use more bandwidth than others, and so it wouldn't be surprising that one VM on the machine was the "cause". But when the disconnections happen, they appear to be completely random. Checking with grep/uniq -c, I see out of the 11 devices on the systems, it happens pretty randomly (and drbd10 is just a test, getting absolutely zero I/O). 5 drbd0: 5 drbd1: 11 drbd2: 8 drbd3: 11 drbd4: 4 drbd5: 6 drbd6: 7 drbd7: 5 drbd8: 14 drbd9: 12 drbd10: 7 drbd11: So even if upping the ping time stops the problem, and even if the effects of the disconnect/reconnect cycles are harmless - why might DRBD exhibit these symptoms on one pair of servers, but not two other sets? Is there some I/O pattern that might cause pings to get lost, even over a lightly-loaded gigabit link? Thanks for any insights in advance. -- Matthew