[DRBD-user] mkfs on a drbd partition hangs in drbd_al_begin_io

Lars Ellenberg lars.ellenberg at linbit.com
Thu May 3 15:19:52 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, May 03, 2007 at 02:55:30PM +0200, Håkan Engblom wrote:
> Hi, See below.
> 
> 
> br Håkan Engblom
> 
> 
> >From: Lars Ellenberg <lars.ellenberg at linbit.com>
> >To: drbd-user at lists.linbit.com
> >Subject: Re: [DRBD-user] mkfs on a drbd partition hangs in drbd_al_begin_io
> >Date: Thu, 3 May 2007 14:09:32 +0200
> >
> >On Thu, May 03, 2007 at 01:23:19PM +0200, Håkan Engblom wrote:
> >> Hi,
> >>
> >> Some background: drbd-version is 0.7.22, running on a Montavista Linux
> >> dirstribution  2.6.10_mvl4
> >>
> >> I've seen that sometimes when doing mkfs on a drbd-partition, the system
> >> seem to hang in a drbd-function in kernel-space.
> >> The problem has been reported once before to this mailing-list, in 
> >February
> >> 2006, a thread called "mkfs hangs with lastest drbd branch build and FC4
> >> kernel" (I thin it is the same problem) and it has also been observed by
> >> others (seen when searching for "drbd_al_begin_io hangs" in google)
> >>
> >> However I've not seen any soultion to the problem.
> >>
> >> So far what I've been able to establish that the process seem to hang in
> >> the dbrd-function mentioned above, and I also know that it hangs 640 
> >bytes
> >> into the function. When looking at the source code of this function, my
> >> guess is that it hangs on "spin_lock_irq(&mdev->al_lock);".
> >>
> >> Is this a known problem and does anyone know of a soultion ?
> >
> >hanging in "spin_lock_irq" translates to a hard lockup of the machine.
> >so, this is most likely not the correct guess.
> It could be a faulty conclusion ofcourse, but if the mkfs-command never 
> returns to user-space (strace gives no output at all) and every time I look 
> in /proc/<mkfs-PID>/wchan I can see that it is inside drbd_al_begin_io, 
> isn't it indicating that it is hung inside that function ? If it is hanging 
> in that function, what else could it be if it is not in spin_lock_irq, 
> especially since it is 640 bytes into the function, and that seem to be 
> close to the end of the function ?

well, if you can still access the box, and even strace things and stuff,
then it is certainly hanging in a spin_lock_irq :)

drbd_al_begin_io (sometimes, not always)
needs to do a drbd meta data transaction.
meaning it writes 512 bytes to the drbd meta data area,
and only returns once this write is completed.
if that write is never completed, well, it never returns.
so aparently, for some reason, in your setup sometimes the lower level
storage drivers decide to not complete this timely.

> >what exactly are the symptoms of that "hang"?
> The symptom, looking at it from a high level, is that the mkfs never 
> finishes. When doing strace on the process, it is also possible to see that 
> nothing happens, it is stuck in the kernel.

> >do the numbers in /proc/drbd move, still?
> Don't know. I will check that the next time I see the problem.
> 
> >
> >can you reproduce this with some different kernel,
> >preferably plain kernel.org?
> Yes and no. Theoretically it would be possible, but I don't think I would 
> get the time to do that from my project-manager. In addition to this, the 
> problem is quite difficult to reproduce. It is seen sometimes when i do an 
> initial install of my system, including creating new partition-tables, and 
> formatting the drbd-partitions. But it is far from every time I see the 
> problem, it is seen maybe 1/10 times when I do a reinstall.
> If I setup a limited tesst-environment to try to reproduce the fault, my 
> expirience in troubleshooting these kind of problems tells me that the 
> problem might not occur if the environment is scaled down. I could be 
> wrong, but it has happened several times before when I've had similar 
> problems.

really, this is something in your setup.
some misbehaving lower level storage driver would be my guess.
maybe it would be possible to do a workaround within drbd.
but it is not drbd's fault.

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list