[DRBD-user] mkfs on a drbd partition hangs in drbd_al_begin_io

Håkan Engblom zyber_cynic at hotmail.com
Thu May 3 14:55:30 CEST 2007

Hi, See below.

br Håkan Engblom

>From: Lars Ellenberg <lars.ellenberg at linbit.com>
>To: drbd-user at lists.linbit.com
>Subject: Re: [DRBD-user] mkfs on a drbd partition hangs in drbd_al_begin_io
>Date: Thu, 3 May 2007 14:09:32 +0200
>On Thu, May 03, 2007 at 01:23:19PM +0200, Håkan Engblom wrote:
> > Hi,
> >
> > Some background: drbd-version is 0.7.22, running on a Montavista Linux
> > dirstribution  2.6.10_mvl4
> >
> > I've seen that sometimes when doing mkfs on a drbd-partition, the system
> > seem to hang in a drbd-function in kernel-space.
> > The problem has been reported once before to this mailing-list, in 
> > 2006, a thread called "mkfs hangs with lastest drbd branch build and FC4
> > kernel" (I thin it is the same problem) and it has also been observed by
> > others (seen when searching for "drbd_al_begin_io hangs" in google)
> >
> > However I've not seen any soultion to the problem.
> >
> > So far what I've been able to establish that the process seem to hang in
> > the dbrd-function mentioned above, and I also know that it hangs 640 
> > into the function. When looking at the source code of this function, my
> > guess is that it hangs on "spin_lock_irq(&mdev->al_lock);".
> >
> > Is this a known problem and does anyone know of a soultion ?
>hanging in "spin_lock_irq" translates to a hard lockup of the machine.
>so, this is most likely not the correct guess.
It could be a faulty conclusion ofcourse, but if the mkfs-command never 
returns to user-space (strace gives no output at all) and every time I look 
in /proc/<mkfs-PID>/wchan I can see that it is inside drbd_al_begin_io, 
isn't it indicating that it is hung inside that function ? If it is hanging 
in that function, what else could it be if it is not in spin_lock_irq, 
especially since it is 640 bytes into the function, and that seem to be 
close to the end of the function ?

>what exactly are the symptoms of that "hang"?
The symptom, looking at it from a high level, is that the mkfs never 
finishes. When doing strace on the process, it is also possible to see that 
nothing happens, it is stuck in the kernel.

>do the numbers in /proc/drbd move, still?
Don't know. I will check that the next time I see the problem.

>can you reproduce this with some different kernel,
>preferably plain kernel.org?
Yes and no. Theoretically it would be possible, but I don't think I would 
get the time to do that from my project-manager. In addition to this, the 
problem is quite difficult to reproduce. It is seen sometimes when i do an 
initial install of my system, including creating new partition-tables, and 
formatting the drbd-partitions. But it is far from every time I see the 
problem, it is seen maybe 1/10 times when I do a reinstall.
If I setup a limited tesst-environment to try to reproduce the fault, my 
expirience in troubleshooting these kind of problems tells me that the 
problem might not occur if the environment is scaled down. I could be wrong, 
but it has happened several times before when I've had similar problems.

>does it hang only when "Connected" or also when "StandAlone"?
Don't know. So far it has always been seen imedialtely after drbd has been 
started, and thus the state has been either SyncSource or possibly 
PausedSyncS, so is has contact with the secondary node, but it is not fully 
syncronised. In the system we have three drbd-partitions, and they are 
formated sequentially, one after the other. The system can hang during 
formatting of any of these three partitions.

>does running "while true; do sync; usleep 1; done" help?
>   when run on the Primary?
>                   Secondary?
>                   both?
Don't know. I can try it the next time I see the problem.

>is this on a software raid?
The partition we use below drbd (/dev/sda...) is an ordinary sas-disk. We 
use drbd to get server-redundancy. No additional software is used for 

>does it help doing this without software raid?


