Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Jul 31, 2007 at 05:58:09PM +0200, Rainer Sabelka wrote: > Hi, > > I'm trying out drbd on two identical Linux boxes running Debian etch (standard > debian kernel 2.6.18-4-686) and drbd 8.0.4 (compiled from the sources in the > debian unstable repository) > > From time to time I get a kernel oops caused by drbd. I think it happened > during stopping drbd using "/etc/init.d/drbd stop" (altough I'm not 100% > surehere, because it always happened when executing some shell script which > contains multiple drbd related operations). > > Unfortunately I'm not able to reproduce this - in almost all cases starting > and stopping drbd just works. > > However, I thought I'll post the kernel traces of my last two drbd crashes to > this list, maybe someone has an idea what has been going wrong. > > -------------------------------------------------------------- > 1st incident: > > Jul 30 14:57:40 newserver2 kernel: drbd0: peer( Primary -> Unknown ) > conn( SyncTarget -> Disconnecting ) pdsk( UpToDate -> DUnknown ) > Jul 30 14:57:40 newserver2 kernel: drbd0: short read receiving data: read 720 > expected 4096 > Jul 30 14:57:40 newserver2 kernel: drbd0: error receiving RSDataReply, l: > 32792! > Jul 30 14:57:40 newserver2 kernel: drbd0: asender terminated > Jul 30 14:57:40 newserver2 kernel: drbd0: tl_clear() > Jul 30 14:57:40 newserver2 kernel: drbd0: Connection closed > Jul 30 14:57:40 newserver2 kernel: drbd0: Writing meta data super block now. > Jul 30 14:57:40 newserver2 kernel: drbd0: conn( Disconnecting -> StandAlone ) > Jul 30 14:57:40 newserver2 kernel: drbd0: receiver terminated > Jul 30 14:57:40 newserver2 kernel: drbd0: disk( Inconsistent -> Diskless ) > Jul 30 14:57:40 newserver2 kernel: drbd0: drbd_bm_resize called with capacity > == 0 > Jul 30 14:57:40 newserver2 kernel: drbd0: disk( Diskless -> Attaching ) > Jul 30 14:57:40 newserver2 kernel: drbd0: No usable activity log found. > Jul 30 14:57:40 newserver2 kernel: drbd0: max_segment_size ( = BIO size ) = > 32768 > Jul 30 14:57:40 newserver2 kernel: drbd0: drbd_bm_resize: > (down_trylock(&b->bm_change)) in /usr/src/modules/drbd/drbd/drbd_bitmap.c:370 > Jul 30 14:57:40 newserver2 kernel: drbd0: drbd_bm_resize called with capacity > == 1409178656 so you do have ~ 1.4 TB of storage. > Jul 30 14:57:40 newserver2 kernel: drbd0: worker terminated > Jul 30 14:57:40 newserver2 kernel: BUG: unable to handle kernel NULL pointer > dereference at virtual address 0000004c > Jul 30 14:57:40 newserver2 kernel: printing eip: > Jul 30 14:57:40 newserver2 kernel: f8c90f85 > Jul 30 14:57:40 newserver2 kernel: *pde = 00000000 > Jul 30 14:57:40 newserver2 kernel: Oops: 0000 [#1] > Jul 30 14:57:40 newserver2 kernel: SMP > Jul 30 14:57:40 newserver2 kernel: Modules linked in: drbd usbhid nfs nfsd > exportfs lockd nfs_acl sunrpc button ac battery ipv6 cn dm_snapshot dm_mirror > dm_mod loop serio_ > raw psmouse rtc shpchp pci_hotplug evdev pcspkr ext3 jbd mbcache ide_cd cdrom > piix sd_mod generic ehci_hcd ide_core uhci_hcd qla2xxx megaraid_sas bnx2 > usbcore firmware_cla > ss scsi_transport_fc scsi_mod thermal processor fan > Jul 30 14:57:40 newserver2 kernel: CPU: 7 > Jul 30 14:57:40 newserver2 kernel: EIP: 0060:[<f8c90f85>] Not tainted > VLI > Jul 30 14:57:40 newserver2 kernel: EFLAGS: 00010246 (2.6.18-4-686 #1) > Jul 30 14:57:40 newserver2 kernel: EIP is at drbd_bm_resize+0x146/0x37b [drbd] > Jul 30 14:57:40 newserver2 kernel: eax: 00000000 ebx: 00000000 ecx: > 0a7fcb84 edx: 00000000 > Jul 30 14:57:40 newserver2 kernel: esi: 00000000 edi: 00000000 ebp: > f7fc8d40 esp: df977eb4 > Jul 30 14:57:40 newserver2 kernel: ds: 007b es: 007b ss: 0068 > Jul 30 14:57:40 newserver2 kernel: Process cqueue/7 (pid: 2743, ti=df976000 > task=c3307aa0 task.ti=df976000) > Jul 30 14:57:40 newserver2 kernel: Stack: 53fe5c20 00000000 f6ec1400 53fe5c20 > 0a7fcb84 0053fe5e 00000000 53fe5c20 > Jul 30 14:57:40 newserver2 kernel: 00000000 00000000 00000000 f8ca7c3d > f6ec1400 0000a848 53fe5c20 00000000 > Jul 30 14:57:40 newserver2 kernel: 53fe5c20 00000000 c011d97e f8cb1ca9 > df977f14 df977f14 f8ca7697 f8cb1ca9 > Jul 30 14:57:40 newserver2 kernel: Call Trace: > Jul 30 14:57:40 newserver2 kernel: [<f8ca7c3d>] > drbd_determin_dev_size+0x14d/0x370 [drbd] > Jul 30 14:57:40 newserver2 kernel: [<c011d97e>] printk+0x14/0x18 > Jul 30 14:57:40 newserver2 kernel: [<f8ca7697>] > drbd_setup_queue_param+0x2bc/0x2f6 [drbd] > Jul 30 14:57:40 newserver2 kernel: [<f8c90d2f>] __drbd_bm_lock+0x18/0xda it may be that you see two problems: first, some race condition accessing the bitmap, second the debug aid that should help in finding the cause of the race condition dereferencing a NULL pointer in a debug printk... and that NULL pointer would be the mdev->bitmap, compare with drbd_bitmap.c ~ line 780 (drbd_bm_rw)... and the printk in the __drbd_bm_lock function... if this is actually the case, we need to track down and fix the cause of the race condition. as a work around for now, I'd recommend to add some "sleep 1" statements between your "multiple drbd related operations", and maybe even between the various drbdadm calls in the drbd init script. > [drbd] > Jul 30 14:57:40 newserver2 kernel: [<f8ca9359>] drbd_nl_disk_conf+0x524/0x788 > [drbd] > Jul 30 14:57:40 newserver2 kernel: [<f8ca8d86>] > drbd_connector_callback+0xc4/0x173 [drbd] ... > ------------------------------------------------------------------------------ > > 2nd incident: > > Jul 31 17:23:45 newserver1 kernel: drbd0: peer( Secondary -> Unknown ) > conn( SyncSource -> Disconnecting ) > Jul 31 17:23:45 newserver1 kernel: drbd0: drbd_pp_alloc interrupted! > Jul 31 17:23:45 newserver1 kernel: drbd0: alloc_ee: Allocation of a page > failed > Jul 31 17:23:45 newserver1 kernel: drbd0: error receiving RSDataRequest, l: > 24! > Jul 31 17:23:45 newserver1 kernel: drbd0: asender terminated > Jul 31 17:23:45 newserver1 kernel: drbd0: _drbd_send_page: size=4096 len=3492 > sent=-104 > Jul 31 17:23:45 newserver1 kernel: drbd0: drbd_send_block() failed > Jul 31 17:23:45 newserver1 kernel: drbd0: tl_clear() > Jul 31 17:23:45 newserver1 kernel: drbd0: Connection closed > Jul 31 17:23:45 newserver1 kernel: drbd0: Writing meta data super block now. > Jul 31 17:23:45 newserver1 kernel: drbd0: conn( Disconnecting -> StandAlone ) > Jul 31 17:23:45 newserver1 kernel: drbd0: receiver terminated > Jul 31 17:23:45 newserver1 kernel: drbd0: disk( UpToDate -> Diskless ) > pdsk( Inconsistent -> DUnknown ) > Jul 31 17:23:45 newserver1 kernel: drbd0: drbd_bm_resize called with capacity > == 0 > Jul 31 17:23:45 newserver1 kernel: drbd0: ASSERT( list_empty(&mdev->net_ee) ) > in /usr/src/modules/drbd/drbd/drbd_main.c:2103 > Jul 31 17:23:45 newserver1 kernel: drbd0: worker terminated > Jul 31 17:23:45 newserver1 kernel: ------------[ cut here ]------------ > Jul 31 17:23:45 newserver1 kernel: kernel BUG at include/linux/mm.h:300! this one should be fixed with svn revision 2964. please try again with current svn.drbd.org/drbd/branches/drbd-8.0/ -- : Lars Ellenberg Tel +43-1-8178292-55 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com :