Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Am Montag, 14. Juli 2008 20:59:15 schrieb nathan at robotics.net: > On Mon, 14 Jul 2008, Lee Christie wrote: > > During testing we knocked up a pair of servers with 5TB. Seemed to work > > just fine, taken at face value. However we didn't put much data on the > > volume so it is possible that it doesn't work with >4TB and we were just > > lucky. > > Centos 5.2 kernel 2.6.18-92.1.6.el5xen running DRBD 8.2.6 > > Box with valid data comes up fine with: > > /sbin/service cman start > /sbin/service drbd start > /sbin/service clvmd start > /bin/mount -t gfs /dev/mirror/share /share > > I try to bring up 2nd primary, shortly after drbd start comes up I get a > kernel panic, unfortunately, I am not able to get the full crash on my > screen and nothing shows up in any logs. Below is what I was able to make > out on the screen, it is typed in by hand so... > > Call Trace: > [<ffffffff885efa11>] :drbd:w_make_resync_request+0x191/0x3f7 > [<ffffffff885ef731>] :drbd:drbd_worker+0x2a3/0x3f2 > [<ffffffff8028773e>] __wake_up_common+0x33/0x68 > [<ffffffff8860532a>] :drbd:drbd_thread_setup_0xa2/0x18b > [<ffffffff80260b24>] child_rip+0xa/0x12 > [<ffffffff88605288>] :drbd:drbd_thread_setup_0x0/0x18b > [<ffffffff80260b1a>] child_rip+0x0/0x12 > > Code: 44 0f a3 20 19 d2 32 db 85 d2 0f 95 c3 eb 20 75 05 83 cb ff > RIP [<ffffffff885ece4e>] :drbd:drbd_bm_test_bit+0x8b/0xd1 > Hi Nathan, Thanks for typing that Call trace... Here is an excerpt from drbd_bitmap.c with the line marked where the crash happened. int drbd_bm_test_bit(struct drbd_conf *mdev, const unsigned long bitnr) { unsigned long flags; struct drbd_bitmap *b = mdev->bitmap; int i; ERR_IF(!b) return 0; ERR_IF(!b->bm) return 0; spin_lock_irqsave(&b->bm_lock, flags); if (bitnr < b->bm_bits) { i = test_bit(bitnr, b->bm) ? 1 : 0; <=<<==<<<===<<<<====<<<<<===== HERE } else if (bitnr == b->bm_bits) { i = -1; } else { /* (bitnr > b->bm_bits) */ ERR("bitnr=%lu > bm_bits=%lu\n", bitnr, b->bm_bits); i = 0; } spin_unlock_irqrestore(&b->bm_lock, flags); return i; } You are right with that we should print a nice error message saying that something went wrong with the allocation of the bitmap instead of OOPSing in that case. As far as I know we do that. The question here is, why does it not abort with a failed bitmap allocation ? Can you provide us the kernel log from just before the crash ? Was the resync already running for some time, or does it crash instantaneously ? Are there any chances that you could also proved the upper part of the OOPS ? Nathan, I do not want to create the impression that it will work for you if you help us to fix this. Probably it will then fail for you with a nice error message in the kernel log saying that the allocation of the bitmap failed... -Phil -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, 1120 Vienna, Austria http://www.linbit.com :