Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Sat, Mar 08, 2008 at 02:27:04PM -0800, Tom Brown wrote: > Hardware: 2 Dell PowerEdge SC1435's > Mirrored Drive: 1TB Hitachi HUA72101 SATA > OS: Debian Etch 4.0r3 > Kernel: vanilla kernel 2.6.24.3 > DRBD: 8.2.5 > Heartbeat: 2.1.3 > > I had drbd and heartbeat up and running. I did initial tests of the > fail-over and mirroring. Everything worked as expected. Then I attached > an external drive via firewire to a SIIG firewire card in the primary > node. I mounted the external drive on /backup. The /dev/drbd0 device is > mounted on /ha. Then I issued the following command at 17:50 and left > for the night: > > tar cf /ha/fullbackup.tar /backup/ha > > The /backup/ha directory contains 334GB of data. When I came in to work > this moring, I issued an 'ls -lh /ha' command and it hung. I checked > syslog and found this: > > Mar 7 20:03:02 fs01 kernel: drbd0: FIXME (barrier_acked but pending) > f6af0688 W L-coNp-s-- 82821 (621446208s +4096) Connected a write request, belongin into epoch 82821, 4kB in size, to sector 621446208, locally completed ok, actually handed over to the tcp stack, but still waiting on the "Ack" from the remote node. the barrier ack which closes this epoch came in before the Ack, which also explains why the barrier ack'ed set_size is one less than expected. This "should not happen". it still does not explain why it hangs after that... > Mar 7 20:03:02 fs01 kernel: drbd0: ASSERT( b->n_req == set_size ) > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:238 > Mar 7 20:03:02 fs01 kernel: drbd0: b->n_req = 592 > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:246 > Mar 7 20:03:02 fs01 kernel: drbd0: set_size = 591 > in /usr/src/drbd-8.2.5/drbd/drbd_main.c:247 > > Any access to /ha hangs. The tar command is hung. I found a post from > January 10, 2005 that the second line in the log is nothing to worry > about. I looked in drbd_main.c and didn't see anything that indicated a > major problem. It looks like it just reports the sizes of b->n_req and > set_size when they are not equal. > > What does it mean when b->n_req != set_size? Is this an indicator of why > the drbd0 device is not accessible anymore? Or is the first line from > the log above (where it says FIXME) an indicator of a bigger problem? > > I had to restart the primary. I could access the drbd0 device after it > came back up. I found that the last write to the tar file was at 20:02. > That's about the same time those errors showed up in the log. > > Well I found some posts about a Broadcom NetXtreme II BCM5708 NIC with > TOE causing drbd lockups. I am using an onboard Broadcom NetXtreme > BCM5721 NIC without TOE. One of the posts said to try this: > > ethtool -K ethX tx off > ethtool -K ethX rx off > > Which I did and tried it again. This time it worked. Did I fix the > problem, or just get lucky? Any ideas? There may be some race within DRBD. There may just be data corruption (bit flips) on the wire. it is also possible that there is some incompatibility between DRBD's expectations for some things and how it is actualy done in kernel 2.6.24. Can you give me a "cat /proc/drbd" please? (preferable from the time when it "hung".) well, you could also just tell me the git-hash of the DRBD "8.2.5" and the drbd protocol you have been using, and I'll try to "imagine" possible problems while meditating over pieces of source code... -- : commercial DRBD/HA support and consulting: sales at linbit.com : : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.