[DRBD-user] drbd sync killing machine with OOM

Rene Mayrhofer rene.mayrhofer at gibraltar.at
Thu May 3 20:52:00 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hi all,

[Resent because my mail from yesterday seemingly has not yet been approved by 
the moderators.]

I am a beginning user of drbd8 (on Debian etch) to create HA XEN domU 
instances across two servers. For various reasons (mostly to avoid 
split-brain at any time), I use roughly the following layout:

/dev/mdX --> LVM2 volumes --> drbd volume for each LV --> XEN domU

So only one of the nodes switches a drbd'ed LV into primary and runs the domU 
on it (and yes, I plan on writing a big HOWTO document as soon as it has run 
for a few months and we have enough experience with live migration of XEN 
instances etc.).

The critical problem that's caused our servers to go offline for a few times 
in the past 2 days is that when too many of the drbd volumes are syncing at 
the same time, both nodes fall into OOM hell and reboot after some time, 
repeating the cycle all over again.

Syslog on one of these occasions (after a /etc/init.d/drbd reload call on the 
secondary node where the drbd volumes were manually set to disconnected 
beforehand):

May  2 13:05:12 jupiter kernel: drbd0: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd0: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd0: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd0: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd0: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd0: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd0: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd0: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd14: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd14: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd14: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd14: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd14: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd14: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd14: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd14: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd16: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd16: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd16: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd16: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd16: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd16: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd18: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd18: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd18: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd16: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd18: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd18: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd18: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd18: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May  2 13:05:12 jupiter kernel: drbd6: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd6: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd6: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd6: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd6: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd6: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd6: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd8: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd8: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd8: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd8: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd8: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd8: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd20: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd20: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd8: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May  2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd20: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd20: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd20: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd2: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd2: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd2: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd20: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd2: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd2: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd2: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd2: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd4: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd4: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd4: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd4: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd4: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd4: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd22: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd4: peer( Unknown -> Secondary ) conn( 
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May  2 13:05:12 jupiter kernel: drbd22: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd22: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd22: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd22: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd22: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd22: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd24: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd24: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd24: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd24: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd24: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd24: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd24: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd10: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd10: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd10: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd10: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd10: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd10: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd10: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd12: conn( StandAlone -> Unconnected )
May  2 13:05:12 jupiter kernel: drbd12: receiver (re)started
May  2 13:05:12 jupiter kernel: drbd12: conn( Unconnected -> WFConnection )
May  2 13:05:12 jupiter kernel: drbd12: conn( WFConnection -> WFReportParams )
May  2 13:05:12 jupiter kernel: drbd12: Handshake successful: DRBD Network 
Protocol version 86
May  2 13:05:12 jupiter kernel: drbd12: Peer authenticated using 32 bytes 
of 'sha256' HMAC
May  2 13:05:12 jupiter kernel: drbd12: peer( Unknown -> Primary ) conn( 
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May  2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd8: conn( WFBitMapS -> SyncSource )
May  2 13:05:12 jupiter kernel: drbd8: Began resync as SyncSource (will sync 
241612 KB [60403 bits set]).
May  2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd10: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd10: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd10: Began resync as SyncTarget (will sync 
736764 KB [184191 bits set]).
May  2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd12: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd24: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd24: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd24: Began resync as SyncTarget (will sync 
285660 KB [71415 bits set]).
May  2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd2: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd2: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd2: Began resync as SyncTarget (will sync 
307228 KB [76807 bits set]).
May  2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd18: conn( WFBitMapS -> SyncSource )
May  2 13:05:12 jupiter kernel: drbd18: Began resync as SyncSource (will sync 
159696 KB [39924 bits set]).
May  2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd20: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd12: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd12: Began resync as SyncTarget (will sync 
558272 KB [139568 bits set]).
May  2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd20: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd20: Began resync as SyncTarget (will sync 
114688 KB [28672 bits set]).
May  2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd4: conn( WFBitMapS -> SyncSource )
May  2 13:05:12 jupiter kernel: drbd4: Began resync as SyncSource (will sync 
143312 KB [35828 bits set]).
May  2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd22: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd22: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd22: Began resync as SyncTarget (will sync 
148532 KB [37133 bits set]).
May  2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd6: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd6: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd6: Began resync as SyncTarget (will sync 
1058128 KB [264532 bits set]).
May  2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now.
May  2 13:05:12 jupiter kernel: drbd16: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:12 jupiter kernel: drbd16: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:12 jupiter kernel: drbd16: Began resync as SyncTarget (will sync 
557304 KB [139326 bits set]).
May  2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now.
May  2 13:05:13 jupiter kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:13 jupiter kernel: drbd0: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:13 jupiter kernel: drbd0: Began resync as SyncTarget (will sync 
1019936 KB [254984 bits set]).
May  2 13:05:13 jupiter kernel: drbd0: Writing meta data super block now.
May  2 13:05:13 jupiter kernel: drbd14: conn( WFBitMapT -> WFSyncUUID )
May  2 13:05:13 jupiter kernel: drbd14: conn( WFSyncUUID -> SyncTarget )
May  2 13:05:13 jupiter kernel: drbd14: Began resync as SyncTarget (will sync 
717036 KB [179259 bits set]).
May  2 13:05:13 jupiter kernel: drbd14: Writing meta data super block now.
May  2 13:05:32 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0
May  2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a
May  2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275
May  2 13:05:34 jupiter kernel: [<c01506b2>] read_swap_cache_async+0x2f/0xac
May  2 13:05:34 jupiter kernel: [<c014627e>] swapin_readahead+0x3a/0x58
May  2 13:05:34 jupiter kernel: [<c0148720>] __handle_mm_fault+0xa62/0xfa3
May  2 13:05:34 jupiter kernel: [<c012ce08>] run_posix_cpu_timers+0x1c/0x6bf
May  2 13:05:34 jupiter kernel: [<c028a6b8>] _spin_lock_irq+0x8/0x18
May  2 13:05:34 jupiter kernel: [<c011194d>] do_page_fault+0x6af/0xb76
May  2 13:05:34 jupiter kernel: [<c012759a>] sys_times+0x185/0x1cb
May  2 13:05:34 jupiter kernel: [<c011129e>] do_page_fault+0x0/0xb76
May  2 13:05:34 jupiter kernel: [<c0104a0f>] error_code+0x2b/0x30
May  2 13:05:34 jupiter kernel: Mem-info:
May  2 13:05:34 jupiter kernel: DMA per-cpu:
May  2 13:05:34 jupiter kernel: cpu 0 hot: high 90, batch 15 used:13
May  2 13:05:34 jupiter kernel: cpu 0 cold: high 30, batch 7 used:0
May  2 13:05:34 jupiter kernel: cpu 1 hot: high 90, batch 15 used:14
May  2 13:05:34 jupiter kernel: cpu 1 cold: high 30, batch 7 used:28
May  2 13:05:34 jupiter kernel: cpu 2 hot: high 90, batch 15 used:5
May  2 13:05:34 jupiter kernel: cpu 2 cold: high 30, batch 7 used:26
May  2 13:05:31 jupiter logd: [7785]: WARN: G_CH_check_int: working on IPC 
channel took 4530 ms (> 100 ms)
May  2 13:05:34 jupiter logd: [7805]: WARN: G_CH_check_int: working on IPC 
channel took 4570 ms (> 100 ms)
May  2 13:05:34 jupiter kernel: cpu 3 hot: high 90, batch 15 used:7
May  2 13:05:34 jupiter kernel: cpu 3 cold: high 30, batch 7 used:1
May  2 13:05:34 jupiter kernel: DMA32 per-cpu: empty
May  2 13:05:34 jupiter kernel: Normal per-cpu: empty
May  2 13:05:34 jupiter kernel: HighMem per-cpu: empty
May  2 13:05:34 jupiter kernel: Free pages:        1480kB (0kB HighMem)
May  2 13:05:34 jupiter kernel: Active:107 inactive:10075 dirty:0 
writeback:9985 unstable:0 free:370 slab:6378 mapped:7 pagetables:207
May  2 13:05:34 jupiter kernel: DMA free:1480kB min:2052kB low:2564kB 
high:3076kB active:428kB inactive:40300kB present:264192kB pages_scanned
:20832 all_unreclaimable? no
May  2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May  2 13:05:34 jupiter kernel: DMA32 free:0kB min:0kB low:0kB high:0kB 
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?
 no
May  2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May  2 13:05:34 jupiter kernel: Normal free:0kB min:0kB low:0kB high:0kB 
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable
? no
May  2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May  2 13:05:34 jupiter kernel: HighMem free:0kB min:128kB low:128kB 
high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unrecl
aimable? no
May  2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May  2 13:05:34 jupiter kernel: DMA: 0*4kB 1*8kB 0*16kB 0*32kB 9*64kB 1*128kB 
1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1480kB
May  2 13:05:34 jupiter kernel: DMA32: empty
May  2 13:05:34 jupiter kernel: Normal: empty
May  2 13:05:34 jupiter kernel: HighMem: empty
May  2 13:05:34 jupiter kernel: Swap cache: add 17598, delete 7604, find 
216/258, race 0+0
May  2 13:05:34 jupiter kernel: Free swap  = 2862436kB
May  2 13:05:34 jupiter kernel: Total swap = 2931704kB
May  2 13:05:34 jupiter kernel: Free swap:       2862436kB
May  2 13:05:34 jupiter kernel: 66048 pages of RAM
May  2 13:05:34 jupiter kernel: 0 pages of HIGHMEM
May  2 13:05:34 jupiter kernel: 17980 reserved pages
May  2 13:05:34 jupiter kernel: 354 pages shared
May  2 13:05:34 jupiter kernel: 9996 pages swap cached
May  2 13:05:34 jupiter kernel: 0 pages dirty
May  2 13:05:34 jupiter kernel: 9985 pages writeback
May  2 13:05:34 jupiter kernel: 7 pages mapped
May  2 13:05:34 jupiter kernel: 6378 pages slab
May  2 13:05:34 jupiter kernel: 207 pages pagetables
May  2 13:05:34 jupiter kernel: Out of Memory: Kill process 3509 (slapd) score 
23575 and children.
May  2 13:05:34 jupiter kernel: Out of memory: Killed process 3509 (slapd).
May  2 13:05:34 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0
May  2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a
May  2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275

and then it basically goes boom....

All the drbd volumes have a total of less than 50GB, so according to the hints 
that I found on the web (32kB RAM usage per GB) it should only consume a few 
MB.

What am I doing wrong here? Isn't this supposed to work? How much RAM does 
DRBD need?

Thanks for any hints,
Rene

-- 
-------------------------------------------------
Gibraltar firewall       http://www.gibraltar.at/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070503/f5db426a/attachment.pgp>


More information about the drbd-user mailing list