Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, I've recently upgraded from 8.3 to 8.4 (together with the whole system, from debian wheezy to jessie), and since then I see some rather strange behaviors. I have 4 resources defined, and they all exhibit this issue. From time to time, I see this in the logs: 2015-11-23 15:27:55 warning kernel: [216645.905347] block drbd4: Remote failed to finish a request within ko-count * timeout This indicates that the secondary did not complete a write in 4 * 20 seconds. To further debug the issue I started running diskping and iostat and cat /proc/drbd on both nodes. The primary acted normally (iostat showed 100% util on the drbd device, indicating it's waiting for the secondary, and the drbd stats showed a lot of 'pe' requests). Looking at the stats from the secondary revealed some anomalies: up until the point the request got stuck everything was normal, but after that, I could see a request that's been sent to the local device (lo:1), but wasn't accounted for as coming from anywhere else (ua:0). The system didn't get out of this state without the reestablishment of the session. iostat on the secondary also didn't reveal anything anomalous, there were no stuck requests according to that source. I'm inclined to think it's software-related, since before the upgrade this exact same setup worked flawlessly... How can I debug this further? Current versions: version: 8.4.3 (api:1/proto:86-101) srcversion: 1A9F77B1CA5FF92235C2213 Linux store1 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u6 (2015-11-09) x86_64 GNU/Linux Outputs from the primary during the timeout: iostat: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util drbd3 0.00 0.00 1.00 0.00 4.00 0.00 8.00 29.00 0.00 0.00 0.00 1000.00 100.00 drbd: 3: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate C r----- ns:1371239 nr:0 dw:64963840 dr:400550864 al:729189 bm:39873 lo:0 pe:29 ua:0 ap:29 ep:1 wo:f oos:0 Secondary node (stats from the exact same time): iostat: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdm 0.00 0.00 1.00 1242.00 4.00 1467.50 2.37 0.07 0.06 20.00 0.04 0.06 7.20 (there're more than 1 resources on this backing device, hence the usage) drbd: 3: cs:Connected ro:Secondary/Primary ds:UpToDate/UpToDate C r----- ns:0 nr:25596925 dw:53340613 dr:0 al:0 bm:24597 lo:1 pe:0 ua:0 ap:0 ep:1 wo:f oos:0 (I have several thousand samples per second, and they all show this exact data, so it's not some transient state) Thanks, Matyas