[DRBD-user] Greater then 4TB

Tue Jul 15 14:19:03 CEST 2008

Am Montag, 14. Juli 2008 20:59:15 schrieb nathan at robotics.net:
> On Mon, 14 Jul 2008, Lee Christie wrote:
> > During testing we knocked up a pair of servers with 5TB. Seemed to work
> > just fine, taken at face value. However we didn't put much data on the
> > volume so it is possible that it doesn't work with >4TB and we were just
> > lucky.
>
> Centos 5.2 kernel 2.6.18-92.1.6.el5xen running DRBD 8.2.6
>
> Box with valid data comes up fine with:
>
> /sbin/service cman start
> /sbin/service drbd start
> /sbin/service clvmd start
> /bin/mount -t gfs /dev/mirror/share /share
>
> I try to bring up 2nd primary, shortly after drbd start comes up I get a
> kernel panic, unfortunately, I am not able to get the full crash on my
> screen and nothing shows up in any logs. Below is what I was able to make
> out on the screen, it is typed in by hand so...
>
> Call Trace:
>   [<ffffffff885efa11>] :drbd:w_make_resync_request+0x191/0x3f7
>   [<ffffffff885ef731>] :drbd:drbd_worker+0x2a3/0x3f2
>   [<ffffffff8028773e>] __wake_up_common+0x33/0x68
>   [<ffffffff8860532a>] :drbd:drbd_thread_setup_0xa2/0x18b
>   [<ffffffff80260b24>] child_rip+0xa/0x12
>   [<ffffffff88605288>] :drbd:drbd_thread_setup_0x0/0x18b
>   [<ffffffff80260b1a>] child_rip+0x0/0x12
>
> Code: 44 0f a3 20 19 d2 32 db 85 d2 0f 95 c3 eb 20 75 05 83 cb ff
> RIP [<ffffffff885ece4e>] :drbd:drbd_bm_test_bit+0x8b/0xd1
>

Hi Nathan,

Thanks for typing that Call trace... Here is an excerpt from drbd_bitmap.c
with the line marked where the crash happened.

int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr)
{
	unsigned long flags;
	struct drbd_bitmap *b = mdev->bitmap;
	int i;
	ERR_IF(!b) return 0;
	ERR_IF(!b->bm) return 0;

	spin_lock_irqsave(&b->bm_lock, flags);
	if (bitnr < b->bm_bits) {
		i = test_bit(bitnr, b->bm) ? 1 : 0; <=<<==<<<===<<<<====<<<<<===== HERE
	} else if (bitnr == b->bm_bits) {
		i = -1;
	} else { /* (bitnr > b->bm_bits) */
		ERR("bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits);
		i = 0;
	}

	spin_unlock_irqrestore(&b->bm_lock, flags);
	return i;
}

You are right with that we should print a nice error message saying 
that something went wrong with the allocation of the bitmap instead of OOPSing
in that case. As far as I know we do that.

The question here is, why does it not abort with a failed bitmap allocation ?
Can you provide us the kernel log from just before the crash ?
Was the resync already running for some time, or does it crash instantaneously ?
Are there any chances that you could also proved the upper part of the OOPS ?

Nathan, I do not want to create the impression that it will work for you if
you help us to fix this. Probably it will then fail for you with a nice
error message in the kernel log saying that the allocation of the bitmap
failed...

-Phil
-- 
: Dipl-Ing Philipp Reisner                      Tel +43-1-8178292-50 :
: LINBIT Information Technologies GmbH          Fax +43-1-8178292-82 :
: Vivenotgasse 48, 1120 Vienna, Austria        http://www.linbit.com :