[DRBD-user] 2.6.14-1.1656_FC4 kernel with drbd hangs (was: mkfs hangs)

Tue Jan 24 02:38:04 CET 2006

I have found a way to consistently cause the hang with 2.6.14-1.1656_FC4 &
drbd, and I believe I have enough results to say the fault is related to Neil
Brown's "reduce stack consumption" patch.  I say related because I don't know
enough to determine if Neil's patch is 'rude & evil' :) or if DRBD just
happens to exercise the bio's in a way that his email[2] indicates are "unsafe".

The question I still have now is:
Is this a problem that should be fixed in the kernel, i.e., Neil or others
resubmit the patch again, or is it some unsafeness in DRBD which must be
fixed? (Finger pointing can now begin :)


setup:
The two machines have been configured and the test machine has had
`drbdadm -- --do-what-I-say primary test0`
ran on it, then the machines were allowed to sync.

Method for each kernel under test:
Boot to the kernel you want to test, login to a virtual console (not X), then
as root run the attached lock_drbd_internal_Synced script.

Results:
with 2.6.14-1.1656_FC4smp & 2.6.14-1.1656_FC4, the command locks at
"Writing superblocks and filesystem accounting information:"  with No return.
      The system is still active though, as other commands in other virtual
consoles. {now press the power or reset button until it is ready to reboot.}

with 2.6.14-1.1656_FC4.tdennistsmp & 2.6.14-1.1656_FC4.tdennist (which are the
same as the 2.6.14-1.1656_FC4 ones but without Neil Brown's patch)[1], the
command completes in ~3 minutes with a 1GB partition.


software versions:
drbd-0.7.15 (installed for each of the kernels)
kernels:
	2.6.14-1.1656_FC4smp
	2.6.14-1.1656_FC4
	2.6.14-1.1656_FC4.tdennistsmp  [2]
	2.6.14-1.1656_FC4.tdennist

hardware:
CPUs: Intel(R) Xeon(TM) CPU 1.50GHz
ide hard drive with a 1GB partition for DRBD.

drbd.conf:
resource test0 {
       protocol C;
       startup   {    wfc-timeout  0; }
       disk   {    on-io-error  panic;  }
       net
       {
         timeout      20;# unit: 0.1 seconds
         connect-int  10;# unit: seconds
         ping-int     10;# unit: seconds
         ko-count     30;
       }
       syncer {
         rate  30M;
         #group takes the place of sync-group
         group 1;
         al-extents 257;
       }
       on d-2
       {
         device       /dev/drbd0;
         disk         /dev/hda13;
         address      10.130.163.58:7788;
         #meta-disk    /dev/hda12[0];
         meta-disk    internal;
       }
       on d-5
       {
         device       /dev/drbd0;
         disk         /dev/hda11;
         address      10.130.163.61:7788;
         #meta-disk    /dev/hda10[0];
         meta-disk    internal;
       }

}



[1] the 2.6.14-1.1656_FC4.tdennist kernels were created by
rpm -ivh kernel-2.6.14-1.1656_FC4.src.rpm
cd /to/your_rpm_build_tree/area/
#Apply the following patch to SPECS/kernel-2.6.spec
###begin patch

--- kernel-2.6.spec.1656_FC4    2006-01-20 15:19:49.000000000 -0500
+++ kernel-2.6.spec     2006-01-20 15:25:43.000000000 -0500
@@ -804,3 +804,3 @@
      # Decrease stack usage in block layer
-%patch1790 -p1
+# %patch1790 -p1

###end patch
# then build the rpm
rpm -bb SPECS/kernel-2.6.spec


[2] http://lkml.org/lkml/2005/11/6/169

Todd Denniston wrote:
> Chip Burke wrote:
> 
>> Good call. It is FC4 2.6.14_1656 . Do I need to go back just one 
>> revision?
>> Or is there a specific kernel where this problem popped up?
>>
<SNIP>
> Looking at a diff of the two trees, out of the 42 files with changes, 
> the files I would put the highest chance of causing the problem to be:
> drivers/block/ll_rw_blk.c
<SNIP>
> #The above change comes from a patch Neil Brown sent to 
> linux-kernel at vger.kernel.org
> "Mon, 7 Nov 2005 11:16:48"
>  Subject: do_mount: reduce stack consumption
> Signed-off-by: Neil Brown <neilb at cse.unsw.edu.au>
> Signed-off-by: Neil Brown <neilb at suse.de>
<SNIP>
>>
>> -----Original Message-----
>> From: Anquijix Schiptara [mailto:anquijix at hotmail.com] Sent: Friday, 
>> January 20, 2006 10:34 AM
>> To: cburke at innova-partners.com
>> Subject: RE: [DRBD-user] mkfs hangs
>>
>> If you run FC4 with newest kernel, install an older version, reinstall 
>> drbd module, and all good... There is already a similar thread 
>> according to the newest FC4-Kernel.
>>
>>
>>> From: "Chip Burke" <cburke at innova-partners.com>
>>> Reply-To: cburke at innova-partners.com
>>> To: <drbd-user at linbit.com>
>>> Subject: [DRBD-user] mkfs hangs
>>> Date: Fri, 20 Jan 2006 10:13:44 -0500
>>>
>>> I am running 0.7.15 and I cannot seem to format a drive. When I go to 
>>> run
>>> 'mkfs -j /dev/drbd0', mfks hangs while writing the inode tables at 
>>> the same
>>> place every time. If I stop drbd and format the underlying device, it 
>>> works
>>> fine, but the drbd device is not a happy camper. Any ideas as to what 
>>> the
>>> issue may be? /dev/drbd0 is set to primary and syncs just fine with it's
>>> slave, so everything seems okay. but the drive isn't much good with 
>>> out a
>>> files system.







-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lock_drbd_internal_Synced
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20060123/1e42499a/attachment.txt>