Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi,
We have a drbd setup with two resources. Both primary and secondary
systems have RAID6 devices, and are running
CentOS 6.5
2.6.32-431.20.3.el6.x86_64
drbd 8.3.16
The secondary is connected to the primary via a gigabit connection, we run
regular iperf tests and we get speeds typically 600-900 Mbits/sec.
The secondary and primary are physically in different buildings but live on
the same VLAN.
We run BackupPC on the primary devices, and use drbd to replicate the
BackupPC archives to the remote site. During initial testing the machines
were in the same room, and connected via a gigabit switch. Running backups
against the primary were replicated, no apparent problems.
Problems arose when we moved the secondary to the remote site. The
backups would run, but eventually some backup process would block - get
stuck in an uninterruptible sleep state.
Eventually these blocks would be cleared, but sometimes it took a long
time (>24 hours), even when we had stopped the backups altogether. (We were
unable to stop the blocked processes, but no new backups were started
during that time.).
We observed that the send-Q on the primary, and the recv-Q on the
secondary, had a LOT of data in them:
[djsperka at primary ~]$ netstat -tnp
Proto Recv-Q Send-Q Local Address Foreign Address
State
PID/Program name
--- edit output ---
tcp 0 1669560 111.222.333.444:59203 111.222.333.555:7790 ESTABLISHED
-
---end edit---
The other connections on the drbd ports all have empty send-q and recv-q.
Note the large send-q.
On the secondary, the situation was similar:
tcp 3773784 0 111.222.333.555:7790 111.222.333.444:59203
ESTABLISHED -
Note the large recv-q.
The blocked processes on the primary are these:
[djsperka at primary ~]$ ps axl | awk '$10 ~ /^D/'
1 0 2261 2 20 0 0 0 down D ? 0:01
[xfsalloc/13]
1 0 6952 2 20 0 0 0 drbd_a D ? 18:41
[xfsbufd/drbd1]
1 0 32155 2 20 0 0 0 xfs_bm D ? 0:00
[flush-147:1]
And on the secondary the blocked process was (sorry, I saved the results
of a slightly different command)
[djsperka at secondary ~]$ ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd"
2309 S down_interruptible drbd0_worker
2338 S down_interruptible drbd1_worker
2370 S sk_wait_data drbd0_receiver
2374 D blkdev_issue_flush drbd1_receiver
2487 S sk_wait_data drbd0_asender
2601 S sk_wait_data drbd1_asender
And now, my question(s):
1. Is it reasonable to assume that the large send-Q/recv-Q are leading to
the blocked processes?
2. If so, what drbd settings can we tweak to address the size of those
queues, and possibly to prevent the block(s) from happening in the first
place?
Thank you,
Dan
--
Daniel J. Sperka, Ph. D.
UC Davis Center for Neuroscience
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140820/418327aa/attachment.htm>