Lars Ellenberg Lars.Ellenberg at linbit.com
Thu Mar 16 18:49:12 CET 2006

> I just set up a drbd replication pair and shipped it to the colo. It was working great.

so you did local tests first, and all was working as expected?

> Then we upgraded the kernel to 2.6.15-1.1833_FC4smp. I rebuilt the
> kernel modules on one machine, rebooted, connected and became primary.
> Then I rebooted the other machine into 2.6.15 so I could build its new
> modules. While I was waiting for that, I started transferring some
> data onto the drbd partition (fs already there, had been working
> great) by untarring. It had gone for a little bit when tar hung. I
> couldn't kill it. I figured it would timeout. It didn't. Also no
> errors in /var/log/messages. The other machine couldn't connect at
> this point, so I figured I had hosed something and that I would just
> start with what should be a pristine copy of the data on the other
> machine and make the bad machine resync.
> I power-cycled both machines, made the good machine primary, ran an
> fsck on the fs, mounted it. The bad machine connected and started to
> resync. It all looked as if it was fine. Yay! So then i realized that
> I had the network throttling down way too low (it's a 20G partition)
> and I needed to restart the drbd. So I disconnected the secondary. And
> unmounted the filesystem on the primary. At least I tried to. The
> umount failed (I had modified one file) And now the primary machine is
> hung again.

> So I've rebooted the secondary system back into 2.6.11, and I'm going
> to power cycle the primary again. Any ideas as to what's going on?

now. your report is somewhat unspecific.
anyways, when I read your last sentence "machine is hung",
this might point to a deadlock that could occur when stressing the box.

this possible deadlock is due to a bio_alloc(,GFP_KERNEL) in drbd where
is should have been GFP_NOIO, and has been recognized and fixed just
after we released 0.7.17.

may I ask you to try again with recent drbd svn? 
 svn co http://svn.drbd.org/drbd/branches/drbd-0.7
revision 2111 and greater should contain that fix.
there may be a 0.7.18 bugfix release because of that.

please report your findings.


