[DRBD-user] requests get stuck in secondary

Mon Nov 23 16:55:53 CET 2015

Hi,

I've recently upgraded from 8.3 to 8.4 (together with the whole system,
from debian wheezy to jessie), and since then I see some rather
strange behaviors. I have 4 resources defined, and they all exhibit this
issue. From time to time, I see this in the logs:

2015-11-23 15:27:55 warning kernel: [216645.905347] block drbd4: Remote failed to finish a request within ko-count * timeout

This indicates that the secondary did not complete a write in 4 * 20
seconds. To further debug the issue I started running diskping and iostat
and cat /proc/drbd on both nodes. The primary acted normally (iostat
showed 100% util on the drbd device, indicating it's waiting for the
secondary, and the drbd stats showed a lot of 'pe' requests).
Looking at the stats from the secondary revealed some anomalies: up until
the point the request got stuck everything was normal, but after that, I
could see a request that's been sent to the local device (lo:1), but
wasn't accounted for as coming from anywhere else (ua:0). The system
didn't get out of this state without the reestablishment of the session.
iostat on the secondary also didn't reveal anything anomalous, there were
no stuck requests according to that source.
I'm inclined to think it's software-related, since before the upgrade this
exact same setup worked flawlessly... How can I debug this further?

Current versions:

version: 8.4.3 (api:1/proto:86-101)
srcversion: 1A9F77B1CA5FF92235C2213

Linux store1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u6 (2015-11-09) x86_64 GNU/Linux

Outputs from the primary during the timeout:
iostat:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
drbd3             0.00     0.00    1.00    0.00     4.00     0.00     8.00    29.00    0.00    0.00    0.00 1000.00 100.00

drbd:
 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r-----
    ns:1371239 nr:0 dw:64963840 dr:400550864 al:729189 bm:39873 lo:0 pe:29 ua:0 ap:29 ep:1 wo:f oos:0

Secondary node (stats from the exact same time):
iostat:
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdm               0.00     0.00    1.00 1242.00     4.00  1467.50     2.37     0.07    0.06   20.00    0.04   0.06   7.20
(there're more than 1 resources on this backing device, hence the usage)

drbd:
 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r-----
    ns:0 nr:25596925 dw:53340613 dr:0 al:0 bm:24597 lo:1 pe:0 ua:0 ap:0 ep:1 wo:f oos:0
(I have several thousand samples per second, and they all show this exact
data, so it's not some transient state)

Thanks,
Matyas