[DRBD-user] drbd sync killing machine with OOM
Rene Mayrhofer
rene.mayrhofer at gibraltar.at
Thu May 3 20:52:00 CEST 2007
Hi all,
[Resent because my mail from yesterday seemingly has not yet been approved by
the moderators.]
I am a beginning user of drbd8 (on Debian etch) to create HA XEN domU
instances across two servers. For various reasons (mostly to avoid
split-brain at any time), I use roughly the following layout:
/dev/mdX --> LVM2 volumes --> drbd volume for each LV --> XEN domU
So only one of the nodes switches a drbd'ed LV into primary and runs the domU
on it (and yes, I plan on writing a big HOWTO document as soon as it has run
for a few months and we have enough experience with live migration of XEN
instances etc.).
The critical problem that's caused our servers to go offline for a few times
in the past 2 days is that when too many of the drbd volumes are syncing at
the same time, both nodes fall into OOM hell and reboot after some time,
repeating the cycle all over again.
Syslog on one of these occasions (after a /etc/init.d/drbd reload call on the
secondary node where the drbd volumes were manually set to disconnected
beforehand):
May 2 13:05:12 jupiter kernel: drbd0: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd0: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd0: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd0: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd0: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd0: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd0: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd0: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd14: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd14: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd14: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd14: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd14: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd14: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd14: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd14: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd16: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd16: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd16: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd16: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd16: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd16: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd18: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd18: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd18: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd16: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd18: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd18: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd18: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd18: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May 2 13:05:12 jupiter kernel: drbd6: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd6: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd6: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd6: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd6: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd6: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd6: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd8: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd8: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd8: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd8: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd8: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd8: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd20: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd20: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd8: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May 2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd20: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd20: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd20: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd2: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd2: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd2: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd20: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd2: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd2: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd2: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd2: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd4: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd4: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd4: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd4: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd4: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd4: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd22: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd4: peer( Unknown -> Secondary ) conn(
WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent )
May 2 13:05:12 jupiter kernel: drbd22: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd22: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd22: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd22: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd22: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd22: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd24: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd24: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd24: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd24: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd24: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd24: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd24: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd10: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd10: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd10: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd10: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd10: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd10: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd10: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd12: conn( StandAlone -> Unconnected )
May 2 13:05:12 jupiter kernel: drbd12: receiver (re)started
May 2 13:05:12 jupiter kernel: drbd12: conn( Unconnected -> WFConnection )
May 2 13:05:12 jupiter kernel: drbd12: conn( WFConnection -> WFReportParams )
May 2 13:05:12 jupiter kernel: drbd12: Handshake successful: DRBD Network
Protocol version 86
May 2 13:05:12 jupiter kernel: drbd12: Peer authenticated using 32 bytes
of 'sha256' HMAC
May 2 13:05:12 jupiter kernel: drbd12: peer( Unknown -> Primary ) conn(
WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate )
May 2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd8: conn( WFBitMapS -> SyncSource )
May 2 13:05:12 jupiter kernel: drbd8: Began resync as SyncSource (will sync
241612 KB [60403 bits set]).
May 2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd10: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd10: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd10: Began resync as SyncTarget (will sync
736764 KB [184191 bits set]).
May 2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd12: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd24: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd24: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd24: Began resync as SyncTarget (will sync
285660 KB [71415 bits set]).
May 2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd2: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd2: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd2: Began resync as SyncTarget (will sync
307228 KB [76807 bits set]).
May 2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd18: conn( WFBitMapS -> SyncSource )
May 2 13:05:12 jupiter kernel: drbd18: Began resync as SyncSource (will sync
159696 KB [39924 bits set]).
May 2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd20: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd12: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd12: Began resync as SyncTarget (will sync
558272 KB [139568 bits set]).
May 2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd20: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd20: Began resync as SyncTarget (will sync
114688 KB [28672 bits set]).
May 2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd4: conn( WFBitMapS -> SyncSource )
May 2 13:05:12 jupiter kernel: drbd4: Began resync as SyncSource (will sync
143312 KB [35828 bits set]).
May 2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd22: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd22: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd22: Began resync as SyncTarget (will sync
148532 KB [37133 bits set]).
May 2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd6: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd6: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd6: Began resync as SyncTarget (will sync
1058128 KB [264532 bits set]).
May 2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now.
May 2 13:05:12 jupiter kernel: drbd16: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:12 jupiter kernel: drbd16: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:12 jupiter kernel: drbd16: Began resync as SyncTarget (will sync
557304 KB [139326 bits set]).
May 2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now.
May 2 13:05:13 jupiter kernel: drbd0: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:13 jupiter kernel: drbd0: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:13 jupiter kernel: drbd0: Began resync as SyncTarget (will sync
1019936 KB [254984 bits set]).
May 2 13:05:13 jupiter kernel: drbd0: Writing meta data super block now.
May 2 13:05:13 jupiter kernel: drbd14: conn( WFBitMapT -> WFSyncUUID )
May 2 13:05:13 jupiter kernel: drbd14: conn( WFSyncUUID -> SyncTarget )
May 2 13:05:13 jupiter kernel: drbd14: Began resync as SyncTarget (will sync
717036 KB [179259 bits set]).
May 2 13:05:13 jupiter kernel: drbd14: Writing meta data super block now.
May 2 13:05:32 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0
May 2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a
May 2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275
May 2 13:05:34 jupiter kernel: [<c01506b2>] read_swap_cache_async+0x2f/0xac
May 2 13:05:34 jupiter kernel: [<c014627e>] swapin_readahead+0x3a/0x58
May 2 13:05:34 jupiter kernel: [<c0148720>] __handle_mm_fault+0xa62/0xfa3
May 2 13:05:34 jupiter kernel: [<c012ce08>] run_posix_cpu_timers+0x1c/0x6bf
May 2 13:05:34 jupiter kernel: [<c028a6b8>] _spin_lock_irq+0x8/0x18
May 2 13:05:34 jupiter kernel: [<c011194d>] do_page_fault+0x6af/0xb76
May 2 13:05:34 jupiter kernel: [<c012759a>] sys_times+0x185/0x1cb
May 2 13:05:34 jupiter kernel: [<c011129e>] do_page_fault+0x0/0xb76
May 2 13:05:34 jupiter kernel: [<c0104a0f>] error_code+0x2b/0x30
May 2 13:05:34 jupiter kernel: Mem-info:
May 2 13:05:34 jupiter kernel: DMA per-cpu:
May 2 13:05:34 jupiter kernel: cpu 0 hot: high 90, batch 15 used:13
May 2 13:05:34 jupiter kernel: cpu 0 cold: high 30, batch 7 used:0
May 2 13:05:34 jupiter kernel: cpu 1 hot: high 90, batch 15 used:14
May 2 13:05:34 jupiter kernel: cpu 1 cold: high 30, batch 7 used:28
May 2 13:05:34 jupiter kernel: cpu 2 hot: high 90, batch 15 used:5
May 2 13:05:34 jupiter kernel: cpu 2 cold: high 30, batch 7 used:26
May 2 13:05:31 jupiter logd: [7785]: WARN: G_CH_check_int: working on IPC
channel took 4530 ms (> 100 ms)
May 2 13:05:34 jupiter logd: [7805]: WARN: G_CH_check_int: working on IPC
channel took 4570 ms (> 100 ms)
May 2 13:05:34 jupiter kernel: cpu 3 hot: high 90, batch 15 used:7
May 2 13:05:34 jupiter kernel: cpu 3 cold: high 30, batch 7 used:1
May 2 13:05:34 jupiter kernel: DMA32 per-cpu: empty
May 2 13:05:34 jupiter kernel: Normal per-cpu: empty
May 2 13:05:34 jupiter kernel: HighMem per-cpu: empty
May 2 13:05:34 jupiter kernel: Free pages: 1480kB (0kB HighMem)
May 2 13:05:34 jupiter kernel: Active:107 inactive:10075 dirty:0
writeback:9985 unstable:0 free:370 slab:6378 mapped:7 pagetables:207
May 2 13:05:34 jupiter kernel: DMA free:1480kB min:2052kB low:2564kB
high:3076kB active:428kB inactive:40300kB present:264192kB pages_scanned
:20832 all_unreclaimable? no
May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May 2 13:05:34 jupiter kernel: DMA32 free:0kB min:0kB low:0kB high:0kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable?
no
May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May 2 13:05:34 jupiter kernel: Normal free:0kB min:0kB low:0kB high:0kB
active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable
? no
May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May 2 13:05:34 jupiter kernel: HighMem free:0kB min:128kB low:128kB
high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unrecl
aimable? no
May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0
May 2 13:05:34 jupiter kernel: DMA: 0*4kB 1*8kB 0*16kB 0*32kB 9*64kB 1*128kB
1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1480kB
May 2 13:05:34 jupiter kernel: DMA32: empty
May 2 13:05:34 jupiter kernel: Normal: empty
May 2 13:05:34 jupiter kernel: HighMem: empty
May 2 13:05:34 jupiter kernel: Swap cache: add 17598, delete 7604, find
216/258, race 0+0
May 2 13:05:34 jupiter kernel: Free swap = 2862436kB
May 2 13:05:34 jupiter kernel: Total swap = 2931704kB
May 2 13:05:34 jupiter kernel: Free swap: 2862436kB
May 2 13:05:34 jupiter kernel: 66048 pages of RAM
May 2 13:05:34 jupiter kernel: 0 pages of HIGHMEM
May 2 13:05:34 jupiter kernel: 17980 reserved pages
May 2 13:05:34 jupiter kernel: 354 pages shared
May 2 13:05:34 jupiter kernel: 9996 pages swap cached
May 2 13:05:34 jupiter kernel: 0 pages dirty
May 2 13:05:34 jupiter kernel: 9985 pages writeback
May 2 13:05:34 jupiter kernel: 7 pages mapped
May 2 13:05:34 jupiter kernel: 6378 pages slab
May 2 13:05:34 jupiter kernel: 207 pages pagetables
May 2 13:05:34 jupiter kernel: Out of Memory: Kill process 3509 (slapd) score
23575 and children.
May 2 13:05:34 jupiter kernel: Out of memory: Killed process 3509 (slapd).
May 2 13:05:34 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0
May 2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a
May 2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275
and then it basically goes boom....
All the drbd volumes have a total of less than 50GB, so according to the hints
that I found on the web (32kB RAM usage per GB) it should only consume a few
MB.
What am I doing wrong here? Isn't this supposed to work? How much RAM does
DRBD need?
Thanks for any hints,
Rene
--
-------------------------------------------------
Gibraltar firewall http://www.gibraltar.at/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://lists.linbit.com/pipermail/drbd-user/attachments/20070503/f5db426a/attachment.pgp
More information about the drbd-user
mailing list