[DRBD-user] drbd, blocked processes in D state, send-Q, recv-Q

Thu Aug 21 01:35:23 CEST 2014

Hi,

   We have a drbd setup with two resources. Both primary and secondary
systems have RAID6 devices, and are running

CentOS 6.5
2.6.32-431.20.3.el6.x86_64
drbd 8.3.16

The secondary is connected to the primary via a gigabit connection, we run
regular iperf tests and we get speeds typically 600-900 Mbits/sec.

The secondary and primary are physically in different buildings but live on
the same VLAN.

   We run BackupPC on the primary devices, and use drbd to replicate the
BackupPC archives to the remote site. During initial testing the machines
were in the same room, and connected via a gigabit switch. Running backups
against the primary were replicated, no apparent problems.

   Problems arose when we moved the secondary to the remote site. The
backups would run, but eventually some backup process would block - get
stuck in an uninterruptible sleep state.

   Eventually these blocks would be cleared, but sometimes it took a long
time (>24 hours), even when we had stopped the backups altogether. (We were
unable to stop the blocked processes, but no new backups were started
during that time.).

   We observed that the send-Q on the primary, and the recv-Q on the
secondary, had a LOT of data in them:

[djsperka at primary ~]$ netstat -tnp

Proto Recv-Q Send-Q Local Address               Foreign Address
     State
      PID/Program name
--- edit output ---

tcp        0 1669560 111.222.333.444:59203   111.222.333.555:7790 ESTABLISHED
-
---end edit---

The other connections on the drbd ports all have empty send-q and recv-q.
Note the large send-q.

On the secondary, the situation was similar:

tcp   3773784      0 111.222.333.555:7790    111.222.333.444:59203
ESTABLISHED -

Note the large recv-q.

   The blocked processes on the primary are these:

[djsperka at primary ~]$ ps axl | awk '$10 ~ /^D/'

1     0  2261     2  20   0      0     0 down   D    ?          0:01
[xfsalloc/13]

1     0  6952     2  20   0      0     0 drbd_a D    ?         18:41
[xfsbufd/drbd1]

1     0 32155     2  20   0      0     0 xfs_bm D    ?          0:00
[flush-147:1]

   And on the secondary the blocked process was (sorry, I saved the results
of a slightly different command)

[djsperka at secondary ~]$ ps -eo pid,state,wchan:40,comm | grep -Ee " D |drbd"
 2309 S down_interruptible                       drbd0_worker
 2338 S down_interruptible                       drbd1_worker
 2370 S sk_wait_data                             drbd0_receiver
 2374 D blkdev_issue_flush                       drbd1_receiver
 2487 S sk_wait_data                             drbd0_asender
 2601 S sk_wait_data                             drbd1_asender

   And now, my question(s):

1. Is it reasonable to assume that the large send-Q/recv-Q are leading to
the blocked processes?

2. If so, what drbd settings can we tweak to address the size of those
queues, and possibly to prevent the block(s) from happening in the first
place?

   Thank you,

Dan

-- 
Daniel J. Sperka, Ph. D.
UC Davis Center for Neuroscience
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20140820/418327aa/attachment.htm>