Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sun, 2008-03-09 at 14:35 +0100, Lars Ellenberg wrote: > On Sat, Mar 08, 2008 at 02:27:04PM -0800, Tom Brown wrote: > > Hardware: 2 Dell PowerEdge SC1435's > > Mirrored Drive: 1TB Hitachi HUA72101 SATA > > OS: Debian Etch 4.0r3 > > Kernel: vanilla kernel 2.6.24.3 > > DRBD: 8.2.5 > > Heartbeat: 2.1.3 > > > > I had drbd and heartbeat up and running. I did initial tests of the > > fail-over and mirroring. Everything worked as expected. Then I attached > > an external drive via firewire to a SIIG firewire card in the primary > > node. I mounted the external drive on /backup. The /dev/drbd0 device is > > mounted on /ha. Then I issued the following command at 17:50 and left > > for the night: > > > > tar cf /ha/fullbackup.tar /backup/ha > > > > The /backup/ha directory contains 334GB of data. When I came in to work > > this moring, I issued an 'ls -lh /ha' command and it hung. I checked > > syslog and found this: > > > > Mar 7 20:03:02 fs01 kernel: drbd0: FIXME (barrier_acked but pending) > > f6af0688 W L-coNp-s-- 82821 (621446208s +4096) Connected > > a write request, belongin into epoch 82821, > 4kB in size, to sector 621446208, > locally completed ok, > actually handed over to the tcp stack, > but still waiting on the "Ack" from the remote node. > > the barrier ack which closes this epoch > came in before the Ack, which also explains > why the barrier ack'ed set_size is one less than expected. > > This "should not happen". > it still does not explain why it hangs after that... > > > Mar 7 20:03:02 fs01 kernel: drbd0: ASSERT( b->n_req == set_size ) > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:238 > > Mar 7 20:03:02 fs01 kernel: drbd0: b->n_req = 592 > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:246 > > Mar 7 20:03:02 fs01 kernel: drbd0: set_size = 591 > > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:247 > > > > Any access to /ha hangs. The tar command is hung. I found a post from > > January 10, 2005 that the second line in the log is nothing to worry > > about. I looked in drbd_main.c and didn't see anything that indicated a > > major problem. It looks like it just reports the sizes of b->n_req and > > set_size when they are not equal. > > > > What does it mean when b->n_req != set_size? Is this an indicator of why > > the drbd0 device is not accessible anymore? Or is the first line from > > the log above (where it says FIXME) an indicator of a bigger problem? > > > > I had to restart the primary. I could access the drbd0 device after it > > came back up. I found that the last write to the tar file was at 20:02. > > That's about the same time those errors showed up in the log. > > > > Well I found some posts about a Broadcom NetXtreme II BCM5708 NIC with > > TOE causing drbd lockups. I am using an onboard Broadcom NetXtreme > > BCM5721 NIC without TOE. One of the posts said to try this: > > > > ethtool -K ethX tx off > > ethtool -K ethX rx off > > > > Which I did and tried it again. This time it worked. Did I fix the > > problem, or just get lucky? Any ideas? > > There may be some race within DRBD. > There may just be data corruption (bit flips) on the wire. > it is also possible that there is some incompatibility between > DRBD's expectations for some things and how it is actualy done > in kernel 2.6.24. > > Can you give me a "cat /proc/drbd" please? > (preferable from the time when it "hung".) > > well, you could also just tell me the git-hash of the DRBD "8.2.5" > and the drbd protocol you have been using, and I'll try to "imagine" > possible problems while meditating over pieces of source code... Here's the 'cat /proc/drbd' while it was hung: version: 8.2.5 (api:88/proto:86-88) GIT-hash: 9faf052fdae5ef0c61b4d03890e2d2eab550610c build by tbrown at fs01, 2008-03-06 08:12:19 0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r--- ns:318254844 nr:160 dw:318255004 dr:9762 al:80148 bm:1 lo:0 pe:1 ua:0 ap:1 resync: used:0/31 hits:5 misses:1 starving:0 dirty:0 changed:1 act_log: used:1/257 hits:79483563 misses:83224 starving:0 dirty:3076 changed:80148 I've been able to write another ~300GB to the drbd0 device without a problem. Thanks, Tom