[DRBD-user] Greater then 4TB

Tue Jul 15 22:18:27 CEST 2008

Can I remove that line??

><>
Nathan Stratton                                CTO, BlinkMind, Inc.
nathan at robotics.net                         nathan at blinkmind.com
http://www.robotics.net                        http://www.blinkmind.com

On Tue, 15 Jul 2008, Philipp Reisner wrote:

> Am Montag, 14. Juli 2008 20:59:15 schrieb nathan at robotics.net:
>> On Mon, 14 Jul 2008, Lee Christie wrote:
>>> During testing we knocked up a pair of servers with 5TB. Seemed to work
>>> just fine, taken at face value. However we didn't put much data on the
>>> volume so it is possible that it doesn't work with >4TB and we were just
>>> lucky.
>>
>> Centos 5.2 kernel 2.6.18-92.1.6.el5xen running DRBD 8.2.6
>>
>> Box with valid data comes up fine with:
>>
>> /sbin/service cman start
>> /sbin/service drbd start
>> /sbin/service clvmd start
>> /bin/mount -t gfs /dev/mirror/share /share
>>
>> I try to bring up 2nd primary, shortly after drbd start comes up I get a
>> kernel panic, unfortunately, I am not able to get the full crash on my
>> screen and nothing shows up in any logs. Below is what I was able to make
>> out on the screen, it is typed in by hand so...
>>
>> Call Trace:
>>   [<ffffffff885efa11>] :drbd:w_make_resync_request+0x191/0x3f7
>>   [<ffffffff885ef731>] :drbd:drbd_worker+0x2a3/0x3f2
>>   [<ffffffff8028773e>] __wake_up_common+0x33/0x68
>>   [<ffffffff8860532a>] :drbd:drbd_thread_setup_0xa2/0x18b
>>   [<ffffffff80260b24>] child_rip+0xa/0x12
>>   [<ffffffff88605288>] :drbd:drbd_thread_setup_0x0/0x18b
>>   [<ffffffff80260b1a>] child_rip+0x0/0x12
>>
>> Code: 44 0f a3 20 19 d2 32 db 85 d2 0f 95 c3 eb 20 75 05 83 cb ff
>> RIP [<ffffffff885ece4e>] :drbd:drbd_bm_test_bit+0x8b/0xd1
>>
>
> Hi Nathan,
>
> Thanks for typing that Call trace... Here is an excerpt from drbd_bitmap.c
> with the line marked where the crash happened.
>
> int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr)
> {
> 	unsigned long flags;
> 	struct drbd_bitmap *b = mdev->bitmap;
> 	int i;
> 	ERR_IF(!b) return 0;
> 	ERR_IF(!b->bm) return 0;
>
> 	spin_lock_irqsave(&b->bm_lock, flags);
> 	if (bitnr < b->bm_bits) {
> 		i = test_bit(bitnr, b->bm) ? 1 : 0; <=<<==<<<===<<<<====<<<<<===== HERE
> 	} else if (bitnr == b->bm_bits) {
> 		i = -1;
> 	} else { /* (bitnr > b->bm_bits) */
> 		ERR("bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits);
> 		i = 0;
> 	}
>
> 	spin_unlock_irqrestore(&b->bm_lock, flags);
> 	return i;
> }
>
> You are right with that we should print a nice error message saying
> that something went wrong with the allocation of the bitmap instead of OOPSing
> in that case. As far as I know we do that.
>
> The question here is, why does it not abort with a failed bitmap allocation ?
> Can you provide us the kernel log from just before the crash ?
> Was the resync already running for some time, or does it crash instantaneously ?
> Are there any chances that you could also proved the upper part of the OOPS ?
>
> Nathan, I do not want to create the impression that it will work for you if
> you help us to fix this. Probably it will then fail for you with a nice
> error message in the kernel log saying that the allocation of the bitmap
> failed...
>
> -Phil
> -- 
> : Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
> : Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :
>