[DRBD-user] Oopses/timeouts on 0.7.5 with many big volumes

Renaud Guerin renaud.guerin at wanadooportails.com
Tue Nov 30 11:18:22 CET 2004

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

I'm trying to set up DRBD 0.7.5 on 2.6.10-rc1 with two 4GB RAM dual Xeon 
machines replicating ten ~500GB volumes.
My drbd.conf is mostly standard stuff, I left all the timeouts , 
protocol, startup, and disk directives as default.

I should probably mention that I had to patch the original 0.7.5 drbd 
because of a variable name change in 2.6.10-rc1 (can't remember which 
variable it was right now, but I doubt it has anything to do with my 
problem, it's a pretty trivial patch)

So I have 10 volumes on this machine going from vol21 to vol30.

Problem 1:
---------

When I start drbd for the first time I get something like this :
/etc/init.d/drbd start

Starting DRBD resources:    [vol21]Child process does not terminate!
Exiting.

If I start again:

Starting DRBD resources:    [vol21][vol22]Child process does not terminate!
Exiting.

and so on : one more volume is completed every time I launch the script. 
Looking at the source code, I believe the timeout has to do with the 
SLEEPS_LONG / SLEEPS_VERY_LONG values that probably aren't enough in my 
case (but i'm not sure if I'm hitting the "long" or "very long" timeout).
Can these be made into module parameters, or at least #define's ?

Eventually, after enough launches, all volumes are corrected 
initialized, except...

Problem 2:
---------

It goes on like this until [vol27], where the init script hangs, and I 
have this in the kernel logs:

drbd8: Creating state block
allocation failed: out of vmalloc space - use vmalloc=<size> to increase 
size.
drbd8: BM resizing failed. Leaving size unchanged at size = 488333312 KB
drbd8: Assuming that all blocks are out of sync (aka FullSync)
drbd8: drbd_bm_set_all: (!(b && b->bm)) in 
drivers/block/drbd/drbd_bitmap.c:521
[<f88ea0b9>] drbd_bm_set_all+0x15a/0x15f [drbd]
[<f88ec6b5>] drbd_ioctl_set_disk+0x625/0x739 [drbd]
[<c0240164>] __copy_to_user_ll+0x68/0x6c
[<f88edd9b>] drbd_ioctl+0x7d7/0xc04 [drbd]
[<f88ed5c4>] drbd_ioctl+0x0/0xc04 [drbd]
[<c026a650>] blkdev_ioctl+0x86/0x41b
[<c015bf06>] block_ioctl+0x0/0x15
[<c01659fb>] sys_ioctl+0x101/0x27a
[<c0105963>] syscall_call+0x7/0xb
Unable to handle kernel NULL pointer dereference at virtual address 
00000000
printing eip:
f88e9f9c
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: drbd tg3
CPU:    0
EIP:    0060:[<f88e9f9c>]    Not tainted VLI
EFLAGS: 00010046   (2.6.10-rc1)
EIP is at drbd_bm_set_all+0x3d/0x15f [drbd]
eax: 00000000   ebx: 00000000   ecx: c036b154   edx: 00000000
esi: f5fe6180   edi: f5bd2960   ebp: f6479c00   esp: f5a92e40
ds: 007b   es: 007b   ss: 0068
Process drbdsetup (pid: 18448, threadinfo=f5a92000 task=f63a5a20)
Stack: f8900540 00000008 f88fff2d f8900320 00000209 f6479c04 00000000 
f5fe6180
      f5bd2960 bffff220 f88ec6b5 f5bd2960 00000008 000001c9 f7eaac58 
c03b8380
      00000008 f5fe6b80 f76420bc f7f6e280 f65dc680 00000001 00000005 
00000000
Call Trace:
[<f88ec6b5>] drbd_ioctl_set_disk+0x625/0x739 [drbd]
[<c0240164>] __copy_to_user_ll+0x68/0x6c
[<f88edd9b>] drbd_ioctl+0x7d7/0xc04 [drbd]
[<f88ed5c4>] drbd_ioctl+0x0/0xc04 [drbd]
[<c026a650>] blkdev_ioctl+0x86/0x41b
[<c015bf06>] block_ioctl+0x0/0x15
[<c01659fb>] sys_ioctl+0x101/0x27a
[<c0105963>] syscall_call+0x7/0xb
Code: 00 85 ed 74 07 8b 4d 00 85 c9 75 05 b8 01 00 00 00 85 c0 0f 85 f1 
00 00 00 8d 45 04 89 44 24 14 e8 eb 68 a2 c7 8b 45 00 8b 55 14 <81>3c 90 
67 02 74 83 74 31 89 f8 c7 44 24 0c 0e 02 00 00 c7 44

As the kernel log hinted, I was able to work around this by booting the 
kernel with the vmalloc=256M option.
All volumes were then able to initialize (given enough /etc/init.d/drbd 
starts, see problem 1) with no error.

But maybe an oops is not the most appropriate behaviour for drbd to have 
in this case ?


Problem 3 (most serious)
---------

There's a new, more serious oops that happened on the target machine, a 
few minutes after I finally started synchronisation:

Nov 29 17:49:01 pers23fs kernel: drbd8: drbd8_receiver [12097]: cstate 
Connected --> WFBitMapT
Nov 29 17:49:01 pers23fs kernel: drbd8: Secondary/Secondary --> 
Secondary/Primary
Nov 29 17:49:01 pers23fs kernel: drbd9: I am(S): 
0:00000001:00000001:00000001:00000001:01
Nov 29 17:49:01 pers23fs kernel: drbd9: Peer(P): 
1:00000002:00000001:00000001:00000001:11
Nov 29 17:49:01 pers23fs kernel: drbd9: drbd9_receiver [12879]: cstate 
Connected --> WFBitMapT
Nov 29 17:49:01 pers23fs kernel: drbd9: Secondary/Secondary --> 
Secondary/Primary
Nov 29 17:49:26 pers23fs kernel: drbd1: drbd1_receiver [6773]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:26 pers23fs kernel: drbd1: Resync started as SyncTarget 
(need to sync 488325120 KB [122081280 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd0: drbd0_receiver [6471]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd0: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: ------------[ cut here ]------------
Nov 29 17:49:27 pers23fs kernel: kernel BUG at kernel/timer.c:267!
Nov 29 17:49:27 pers23fs kernel: invalid operand: 0000 [#1]
Nov 29 17:49:27 pers23fs kernel: SMP
Nov 29 17:49:27 pers23fs kernel: Modules linked in: drbd tg3
Nov 29 17:49:27 pers23fs kernel: CPU:    2
Nov 29 17:49:27 pers23fs kernel: EIP:    0060:[mod_timer+96/106]    Not 
tainted VLI
Nov 29 17:49:27 pers23fs kernel: EFLAGS: 00010246   (2.6.10-rc1)
Nov 29 17:49:27 pers23fs kernel: EIP is at mod_timer+0x60/0x6a
Nov 29 17:49:27 pers23fs kernel: eax: 00000000   ebx: ce57032c   ecx: 
c0369a88   edx: 00000282
Nov 29 17:49:27 pers23fs kernel: esi: 0026b60e   edi: ce570000   ebp: 
0000000e   esp: d6094f20
Nov 29 17:49:27 pers23fs kernel: ds: 007b   es: 007b   ss: 0068
Nov 29 17:49:27 pers23fs kernel: Process drbd0_receiver (pid: 6471, 
threadinfo=d6094000 task=eee58020)
Nov 29 17:49:27 pers23fs kernel: Stack: ce570000 c011b1ce 00000292 
ce5703c8 f0910978 ce57032c 0026b60e f0926d36
Nov 29 17:49:27 pers23fs kernel:        1d1b6000 0746d800 00000000 
00000000 003a36c0 ce570000 f09152dc ce570000
Nov 29 17:49:27 pers23fs kernel:        0000000e 000003e8 f0871000 
ce5701dc 00000001 f0871000 003a36c0 ce570000
Nov 29 17:49:27 pers23fs kernel: Call Trace:
Nov 29 17:49:27 pers23fs kernel:  [printk+23/27] printk+0x17/0x1b
Nov 29 17:49:27 pers23fs kernel:  [pg0+809961848/1068884992] 
drbd_start_resync+0x1f1/0x2a1 [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+809980636/1068884992] 
receive_bitmap+0x230/0x27d [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+809980076/1068884992] 
receive_bitmap+0x0/0x27d [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+809982014/1068884992] 
drbdd+0x59/0x13f [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+809984814/1068884992] 
drbdd_init+0x70/0x16e [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+810007830/1068884992] 
drbd_thread_setup+0x78/0xe0 [drbd]
Nov 29 17:49:27 pers23fs kernel:  [pg0+810007710/1068884992] 
drbd_thread_setup+0x0/0xe0 [drbd]
Nov 29 17:49:27 pers23fs kernel:  [kernel_thread_helper+5/11] 
kernel_thread_helper+0x5/0xb
Nov 29 17:49:27 pers23fs kernel: Code: 5c 24 14 8b 74 24 0c 8b 5c 24 08 
83 c4 10 e9 12 fe ff ff 8b 43 1c 85 c0 74 e1 b8 01 00 00 00 8b 5c 24
08 8b 74 24 0c 83 c4 10 c3 <0f> 0b 0b 01 75 3d 32 c0 eb b0 56 53 83 ec 
04 8b 74 24 10 81 7e
Nov 29 17:49:27 pers23fs kernel:  <6>drbd6: drbd6_receiver [10242]: 
cstate WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd6: Resync started as SyncTarget 
(need to sync 488325120 KB [122081280 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd6: drbd6_receiver [10242]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd6: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd8: drbd8_receiver [12097]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd8: Resync started as SyncTarget 
(need to sync 488325120 KB [122081280 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd8: drbd8_receiver [12097]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd8: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd3: drbd3_receiver [8077]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd3: Resync started as SyncTarget 
(need to sync 488325120 KB [122081280 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd3: drbd3_receiver [8077]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd3: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd9: drbd9_receiver [12879]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd9: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd9: drbd9_receiver [12879]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd9: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd2: drbd2_receiver [7154]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd2: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd2: drbd2_receiver [7154]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd2: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd3: Syncer continues.
Nov 29 17:49:27 pers23fs kernel: drbd3: drbd2_worker [7152]: cstate 
PausedSyncT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd8: Syncer continues.
Nov 29 17:49:27 pers23fs kernel: drbd8: drbd2_worker [7152]: cstate 
PausedSyncT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd5: drbd5_receiver [9607]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd5: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd1: drbd5_receiver [9607]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd1: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd3: drbd5_receiver [9607]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd3: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd8: drbd5_receiver [9607]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd8: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd4: drbd4_receiver [8581]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd4: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd4: drbd4_receiver [8581]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd4: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd2: Syncer continues.
Nov 29 17:49:27 pers23fs kernel: drbd2: drbd1_worker [6771]: cstate 
PausedSyncT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd7: drbd7_receiver [10916]: cstate 
WFBitMapT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd7: Resync started as SyncTarget 
(need to sync 488333312 KB [122083328 bits set]).
Nov 29 17:49:27 pers23fs kernel: drbd7: drbd7_receiver [10916]: cstate 
SyncTarget --> PausedSyncT
Nov 29 17:49:27 pers23fs kernel: drbd7: Syncer waits for sync group.
Nov 29 17:49:27 pers23fs kernel: drbd4: Syncer continues.
Nov 29 17:49:27 pers23fs kernel: drbd4: drbd3_worker [8075]: cstate 
PausedSyncT --> SyncTarget
Nov 29 17:49:27 pers23fs kernel: drbd9: Syncer continues.
Nov 29 17:49:27 pers23fs kernel: drbd9: drbd3_worker [8075]: cstate 
PausedSyncT --> SyncTarget

The machine was frozen after that.
Am I doing anything wrong or is this a genuine bug ?

Thanks for any help!




More information about the drbd-user mailing list