Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi, We have a drbd setup with two resources. Both primary and secondary systems have RAID6 devices, and are running CentOS 6.5 2.6.32-431.20.3.el6.x86_64 drbd 8.3.16 The secondary is connected to the primary via a gigabit connection, we run regular iperf tests and we get speeds typically 600-900 Mbits/sec. The secondary and primary are physically in different buildings but live on the same VLAN. We run BackupPC on the primary devices, and use drbd to replicate the BackupPC archives to the remote site. During initial testing the machines were in the same room, and connected via a gigabit switch. Running backups against the primary were replicated, no apparent problems. Problems arose when we moved the secondary to the remote site. The backups would run, but eventually some backup process would block - get stuck in an uninterruptible sleep state. Eventually these blocks would be cleared, but sometimes it took a long time (>24 hours), even when we had stopped the backups altogether. (We were unable to stop the blocked processes, but no new backups were started during that time.). We observed that the send-Q on the primary, and the recv-Q on the secondary, had a LOT of data in them: [djsperka at primary ~]$ netstat -tnp Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name --- edit output --- tcp 0 1669560 111.222.333.444:59203 111.222.333.555:7790 ESTABLISHED - ---end edit--- The other connections on the drbd ports all have empty send-q and recv-q. Note the large send-q. On the secondary, the situation was similar: tcp 3773784 0 111.222.333.555:7790 111.222.333.444:59203 ESTABLISHED - Note the large recv-q. The blocked processes on the primary are these: [djsperka at primary ~]$ ps axl | awk '$10 ~ /^D/' 1 0 2261 2 20 0 0 0 down D ? 0:01 [xfsalloc/13] 1 0 6952 2 20 0 0 0 drbd_a D ? 18:41 [xfsbufd/drbd1] 1 0 32155 2 20 0 0 0 xfs_bm D ? 0:00 [flush-147:1] And on the secondary the blocked process was (sorry, I saved the results of a slightly different command) [djsperka at secondary ~]$ ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd" 2309 S down_interruptible drbd0_worker 2338 S down_interruptible drbd1_worker 2370 S sk_wait_data drbd0_receiver 2374 D blkdev_issue_flush drbd1_receiver 2487 S sk_wait_data drbd0_asender 2601 S sk_wait_data drbd1_asender And now, my question(s): 1. Is it reasonable to assume that the large send-Q/recv-Q are leading to the blocked processes? 2. If so, what drbd settings can we tweak to address the size of those queues, and possibly to prevent the block(s) from happening in the first place? Thank you, Dan -- Daniel J. Sperka, Ph. D. UC Davis Center for Neuroscience -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140820/418327aa/attachment.htm>