Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Can I remove that line??
><>
Nathan Stratton CTO, BlinkMind, Inc.
nathan at robotics.net nathan at blinkmind.com
http://www.robotics.net http://www.blinkmind.com
On Tue, 15 Jul 2008, Philipp Reisner wrote:
> Am Montag, 14. Juli 2008 20:59:15 schrieb nathan at robotics.net:
>> On Mon, 14 Jul 2008, Lee Christie wrote:
>>> During testing we knocked up a pair of servers with 5TB. Seemed to work
>>> just fine, taken at face value. However we didn't put much data on the
>>> volume so it is possible that it doesn't work with >4TB and we were just
>>> lucky.
>>
>> Centos 5.2 kernel 2.6.18-92.1.6.el5xen running DRBD 8.2.6
>>
>> Box with valid data comes up fine with:
>>
>> /sbin/service cman start
>> /sbin/service drbd start
>> /sbin/service clvmd start
>> /bin/mount -t gfs /dev/mirror/share /share
>>
>> I try to bring up 2nd primary, shortly after drbd start comes up I get a
>> kernel panic, unfortunately, I am not able to get the full crash on my
>> screen and nothing shows up in any logs. Below is what I was able to make
>> out on the screen, it is typed in by hand so...
>>
>> Call Trace:
>> [<ffffffff885efa11>] :drbd:w_make_resync_request+0x191/0x3f7
>> [<ffffffff885ef731>] :drbd:drbd_worker+0x2a3/0x3f2
>> [<ffffffff8028773e>] __wake_up_common+0x33/0x68
>> [<ffffffff8860532a>] :drbd:drbd_thread_setup_0xa2/0x18b
>> [<ffffffff80260b24>] child_rip+0xa/0x12
>> [<ffffffff88605288>] :drbd:drbd_thread_setup_0x0/0x18b
>> [<ffffffff80260b1a>] child_rip+0x0/0x12
>>
>> Code: 44 0f a3 20 19 d2 32 db 85 d2 0f 95 c3 eb 20 75 05 83 cb ff
>> RIP [<ffffffff885ece4e>] :drbd:drbd_bm_test_bit+0x8b/0xd1
>>
>
> Hi Nathan,
>
> Thanks for typing that Call trace... Here is an excerpt from drbd_bitmap.c
> with the line marked where the crash happened.
>
> int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr)
> {
> unsigned long flags;
> struct drbd_bitmap *b = mdev->bitmap;
> int i;
> ERR_IF(!b) return 0;
> ERR_IF(!b->bm) return 0;
>
> spin_lock_irqsave(&b->bm_lock, flags);
> if (bitnr < b->bm_bits) {
> i = test_bit(bitnr, b->bm) ? 1 : 0; <=<<==<<<===<<<<====<<<<<===== HERE
> } else if (bitnr == b->bm_bits) {
> i = -1;
> } else { /* (bitnr > b->bm_bits) */
> ERR("bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits);
> i = 0;
> }
>
> spin_unlock_irqrestore(&b->bm_lock, flags);
> return i;
> }
>
> You are right with that we should print a nice error message saying
> that something went wrong with the allocation of the bitmap instead of OOPSing
> in that case. As far as I know we do that.
>
> The question here is, why does it not abort with a failed bitmap allocation ?
> Can you provide us the kernel log from just before the crash ?
> Was the resync already running for some time, or does it crash instantaneously ?
> Are there any chances that you could also proved the upper part of the OOPS ?
>
> Nathan, I do not want to create the impression that it will work for you if
> you help us to fix this. Probably it will then fail for you with a nice
> error message in the kernel log saying that the allocation of the bitmap
> failed...
>
> -Phil
> --
> : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 :
> : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 :
> : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :
>