Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Lars Ellenberg wrote: >> I just set up a drbd replication pair and shipped it to the colo. It was working great. > > so you did local tests first, and all was working as expected? Yes. > now. your report is somewhat unspecific. > anyways, when I read your last sentence "machine is hung", > this might point to a deadlock that could occur when stressing the box. Fair enough. I think the hang was that block device was busy and couldn't be unmounted, and so when I tried to shutdown the machine it blocked waiting for the fs to unmount. > this possible deadlock is due to a bio_alloc(,GFP_KERNEL) in drbd where > is should have been GFP_NOIO, and has been recognized and fixed just > after we released 0.7.17. > > may I ask you to try again with recent drbd svn? > svn co http://svn.drbd.org/drbd/branches/drbd-0.7 > revision 2111 and greater should contain that fix. > there may be a 0.7.18 bugfix release because of that. I'll do that when I get the next set of test machines up. I downgraded the kernel back to 2.6.11 for now on these boxes and everything works as expected again. > please report your findings. I will when I get another test machine and 2.6.15 again, which should be next week. If there's a possible known deadlock, I bet that's what I ran into. On the other hand, is the default value for on-disconnect reconnect or freeze_io? Because if it's freeze_io I would maybe see that being what happened, too. I'll keep trying to isolate for you. I think we're going to be using drbd a bit more as part of our Professional Services offerings to some clients, so it'll be nice to know where the problem actually sits. Thanks! Monty Taylor Senior Consultant, MySQL, Inc.