Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Can I remove that line?? ><> Nathan Stratton CTO, BlinkMind, Inc. nathan at robotics.net nathan at blinkmind.com http://www.robotics.net http://www.blinkmind.com On Tue, 15 Jul 2008, Philipp Reisner wrote: > Am Montag, 14. Juli 2008 20:59:15 schrieb nathan at robotics.net: >> On Mon, 14 Jul 2008, Lee Christie wrote: >>> During testing we knocked up a pair of servers with 5TB. Seemed to work >>> just fine, taken at face value. However we didn't put much data on the >>> volume so it is possible that it doesn't work with >4TB and we were just >>> lucky. >> >> Centos 5.2 kernel 2.6.18-92.1.6.el5xen running DRBD 8.2.6 >> >> Box with valid data comes up fine with: >> >> /sbin/service cman start >> /sbin/service drbd start >> /sbin/service clvmd start >> /bin/mount -t gfs /dev/mirror/share /share >> >> I try to bring up 2nd primary, shortly after drbd start comes up I get a >> kernel panic, unfortunately, I am not able to get the full crash on my >> screen and nothing shows up in any logs. Below is what I was able to make >> out on the screen, it is typed in by hand so... >> >> Call Trace: >> [<ffffffff885efa11>] :drbd:w_make_resync_request+0x191/0x3f7 >> [<ffffffff885ef731>] :drbd:drbd_worker+0x2a3/0x3f2 >> [<ffffffff8028773e>] __wake_up_common+0x33/0x68 >> [<ffffffff8860532a>] :drbd:drbd_thread_setup_0xa2/0x18b >> [<ffffffff80260b24>] child_rip+0xa/0x12 >> [<ffffffff88605288>] :drbd:drbd_thread_setup_0x0/0x18b >> [<ffffffff80260b1a>] child_rip+0x0/0x12 >> >> Code: 44 0f a3 20 19 d2 32 db 85 d2 0f 95 c3 eb 20 75 05 83 cb ff >> RIP [<ffffffff885ece4e>] :drbd:drbd_bm_test_bit+0x8b/0xd1 >> > > Hi Nathan, > > Thanks for typing that Call trace... Here is an excerpt from drbd_bitmap.c > with the line marked where the crash happened. > > int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr) > { > unsigned long flags; > struct drbd_bitmap *b = mdev->bitmap; > int i; > ERR_IF(!b) return 0; > ERR_IF(!b->bm) return 0; > > spin_lock_irqsave(&b->bm_lock, flags); > if (bitnr < b->bm_bits) { > i = test_bit(bitnr, b->bm) ? 1 : 0; <=<<==<<<===<<<<====<<<<<===== HERE > } else if (bitnr == b->bm_bits) { > i = -1; > } else { /* (bitnr > b->bm_bits) */ > ERR("bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits); > i = 0; > } > > spin_unlock_irqrestore(&b->bm_lock, flags); > return i; > } > > You are right with that we should print a nice error message saying > that something went wrong with the allocation of the bitmap instead of OOPSing > in that case. As far as I know we do that. > > The question here is, why does it not abort with a failed bitmap allocation ? > Can you provide us the kernel log from just before the crash ? > Was the resync already running for some time, or does it crash instantaneously ? > Are there any chances that you could also proved the upper part of the OOPS ? > > Nathan, I do not want to create the impression that it will work for you if > you help us to fix this. Probably it will then fail for you with a nice > error message in the kernel log saying that the allocation of the bitmap > failed... > > -Phil > -- > : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : > : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : > : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com : >