Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hi all, [Resent because my mail from yesterday seemingly has not yet been approved by the moderators.] I am a beginning user of drbd8 (on Debian etch) to create HA XEN domU instances across two servers. For various reasons (mostly to avoid split-brain at any time), I use roughly the following layout: /dev/mdX --> LVM2 volumes --> drbd volume for each LV --> XEN domU So only one of the nodes switches a drbd'ed LV into primary and runs the domU on it (and yes, I plan on writing a big HOWTO document as soon as it has run for a few months and we have enough experience with live migration of XEN instances etc.). The critical problem that's caused our servers to go offline for a few times in the past 2 days is that when too many of the drbd volumes are syncing at the same time, both nodes fall into OOM hell and reboot after some time, repeating the cycle all over again. Syslog on one of these occasions (after a /etc/init.d/drbd reload call on the secondary node where the drbd volumes were manually set to disconnected beforehand): May 2 13:05:12 jupiter kernel: drbd0: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd0: receiver (re)started May 2 13:05:12 jupiter kernel: drbd0: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd0: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd0: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd0: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd0: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd14: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd14: receiver (re)started May 2 13:05:12 jupiter kernel: drbd14: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd14: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd14: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd14: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd14: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd14: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd16: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd16: receiver (re)started May 2 13:05:12 jupiter kernel: drbd16: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd16: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd16: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd16: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd18: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd18: receiver (re)started May 2 13:05:12 jupiter kernel: drbd18: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd16: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd18: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd18: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd18: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd18: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) May 2 13:05:12 jupiter kernel: drbd6: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd6: receiver (re)started May 2 13:05:12 jupiter kernel: drbd6: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd6: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd6: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd6: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd6: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd8: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd8: receiver (re)started May 2 13:05:12 jupiter kernel: drbd8: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd8: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd8: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd8: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd20: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd20: receiver (re)started May 2 13:05:12 jupiter kernel: drbd8: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) May 2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd20: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd20: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd20: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd2: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd2: receiver (re)started May 2 13:05:12 jupiter kernel: drbd2: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd20: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd2: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd2: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd2: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd2: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd4: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd4: receiver (re)started May 2 13:05:12 jupiter kernel: drbd4: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd4: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd4: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd4: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd22: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd4: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) May 2 13:05:12 jupiter kernel: drbd22: receiver (re)started May 2 13:05:12 jupiter kernel: drbd22: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd22: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd22: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd22: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd22: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd24: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd24: receiver (re)started May 2 13:05:12 jupiter kernel: drbd24: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd24: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd24: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd24: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd24: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd10: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd10: receiver (re)started May 2 13:05:12 jupiter kernel: drbd10: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd10: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd10: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd10: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd10: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd12: conn( StandAlone -> Unconnected ) May 2 13:05:12 jupiter kernel: drbd12: receiver (re)started May 2 13:05:12 jupiter kernel: drbd12: conn( Unconnected -> WFConnection ) May 2 13:05:12 jupiter kernel: drbd12: conn( WFConnection -> WFReportParams ) May 2 13:05:12 jupiter kernel: drbd12: Handshake successful: DRBD Network Protocol version 86 May 2 13:05:12 jupiter kernel: drbd12: Peer authenticated using 32 bytes of 'sha256' HMAC May 2 13:05:12 jupiter kernel: drbd12: peer( Unknown -> Primary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) May 2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd8: conn( WFBitMapS -> SyncSource ) May 2 13:05:12 jupiter kernel: drbd8: Began resync as SyncSource (will sync 241612 KB [60403 bits set]). May 2 13:05:12 jupiter kernel: drbd8: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd10: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd10: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd10: Began resync as SyncTarget (will sync 736764 KB [184191 bits set]). May 2 13:05:12 jupiter kernel: drbd10: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd12: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd24: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd24: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd24: Began resync as SyncTarget (will sync 285660 KB [71415 bits set]). May 2 13:05:12 jupiter kernel: drbd24: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd2: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd2: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd2: Began resync as SyncTarget (will sync 307228 KB [76807 bits set]). May 2 13:05:12 jupiter kernel: drbd2: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd18: conn( WFBitMapS -> SyncSource ) May 2 13:05:12 jupiter kernel: drbd18: Began resync as SyncSource (will sync 159696 KB [39924 bits set]). May 2 13:05:12 jupiter kernel: drbd18: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd20: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd12: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd12: Began resync as SyncTarget (will sync 558272 KB [139568 bits set]). May 2 13:05:12 jupiter kernel: drbd12: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd20: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd20: Began resync as SyncTarget (will sync 114688 KB [28672 bits set]). May 2 13:05:12 jupiter kernel: drbd20: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd4: conn( WFBitMapS -> SyncSource ) May 2 13:05:12 jupiter kernel: drbd4: Began resync as SyncSource (will sync 143312 KB [35828 bits set]). May 2 13:05:12 jupiter kernel: drbd4: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd22: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd22: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd22: Began resync as SyncTarget (will sync 148532 KB [37133 bits set]). May 2 13:05:12 jupiter kernel: drbd22: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd6: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd6: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd6: Began resync as SyncTarget (will sync 1058128 KB [264532 bits set]). May 2 13:05:12 jupiter kernel: drbd6: Writing meta data super block now. May 2 13:05:12 jupiter kernel: drbd16: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:12 jupiter kernel: drbd16: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:12 jupiter kernel: drbd16: Began resync as SyncTarget (will sync 557304 KB [139326 bits set]). May 2 13:05:12 jupiter kernel: drbd16: Writing meta data super block now. May 2 13:05:13 jupiter kernel: drbd0: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:13 jupiter kernel: drbd0: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:13 jupiter kernel: drbd0: Began resync as SyncTarget (will sync 1019936 KB [254984 bits set]). May 2 13:05:13 jupiter kernel: drbd0: Writing meta data super block now. May 2 13:05:13 jupiter kernel: drbd14: conn( WFBitMapT -> WFSyncUUID ) May 2 13:05:13 jupiter kernel: drbd14: conn( WFSyncUUID -> SyncTarget ) May 2 13:05:13 jupiter kernel: drbd14: Began resync as SyncTarget (will sync 717036 KB [179259 bits set]). May 2 13:05:13 jupiter kernel: drbd14: Writing meta data super block now. May 2 13:05:32 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0 May 2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a May 2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275 May 2 13:05:34 jupiter kernel: [<c01506b2>] read_swap_cache_async+0x2f/0xac May 2 13:05:34 jupiter kernel: [<c014627e>] swapin_readahead+0x3a/0x58 May 2 13:05:34 jupiter kernel: [<c0148720>] __handle_mm_fault+0xa62/0xfa3 May 2 13:05:34 jupiter kernel: [<c012ce08>] run_posix_cpu_timers+0x1c/0x6bf May 2 13:05:34 jupiter kernel: [<c028a6b8>] _spin_lock_irq+0x8/0x18 May 2 13:05:34 jupiter kernel: [<c011194d>] do_page_fault+0x6af/0xb76 May 2 13:05:34 jupiter kernel: [<c012759a>] sys_times+0x185/0x1cb May 2 13:05:34 jupiter kernel: [<c011129e>] do_page_fault+0x0/0xb76 May 2 13:05:34 jupiter kernel: [<c0104a0f>] error_code+0x2b/0x30 May 2 13:05:34 jupiter kernel: Mem-info: May 2 13:05:34 jupiter kernel: DMA per-cpu: May 2 13:05:34 jupiter kernel: cpu 0 hot: high 90, batch 15 used:13 May 2 13:05:34 jupiter kernel: cpu 0 cold: high 30, batch 7 used:0 May 2 13:05:34 jupiter kernel: cpu 1 hot: high 90, batch 15 used:14 May 2 13:05:34 jupiter kernel: cpu 1 cold: high 30, batch 7 used:28 May 2 13:05:34 jupiter kernel: cpu 2 hot: high 90, batch 15 used:5 May 2 13:05:34 jupiter kernel: cpu 2 cold: high 30, batch 7 used:26 May 2 13:05:31 jupiter logd: [7785]: WARN: G_CH_check_int: working on IPC channel took 4530 ms (> 100 ms) May 2 13:05:34 jupiter logd: [7805]: WARN: G_CH_check_int: working on IPC channel took 4570 ms (> 100 ms) May 2 13:05:34 jupiter kernel: cpu 3 hot: high 90, batch 15 used:7 May 2 13:05:34 jupiter kernel: cpu 3 cold: high 30, batch 7 used:1 May 2 13:05:34 jupiter kernel: DMA32 per-cpu: empty May 2 13:05:34 jupiter kernel: Normal per-cpu: empty May 2 13:05:34 jupiter kernel: HighMem per-cpu: empty May 2 13:05:34 jupiter kernel: Free pages: 1480kB (0kB HighMem) May 2 13:05:34 jupiter kernel: Active:107 inactive:10075 dirty:0 writeback:9985 unstable:0 free:370 slab:6378 mapped:7 pagetables:207 May 2 13:05:34 jupiter kernel: DMA free:1480kB min:2052kB low:2564kB high:3076kB active:428kB inactive:40300kB present:264192kB pages_scanned :20832 all_unreclaimable? no May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0 May 2 13:05:34 jupiter kernel: DMA32 free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable? no May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0 May 2 13:05:34 jupiter kernel: Normal free:0kB min:0kB low:0kB high:0kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unreclaimable ? no May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0 May 2 13:05:34 jupiter kernel: HighMem free:0kB min:128kB low:128kB high:128kB active:0kB inactive:0kB present:0kB pages_scanned:0 all_unrecl aimable? no May 2 13:05:34 jupiter kernel: lowmem_reserve[]: 0 0 0 0 May 2 13:05:34 jupiter kernel: DMA: 0*4kB 1*8kB 0*16kB 0*32kB 9*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1480kB May 2 13:05:34 jupiter kernel: DMA32: empty May 2 13:05:34 jupiter kernel: Normal: empty May 2 13:05:34 jupiter kernel: HighMem: empty May 2 13:05:34 jupiter kernel: Swap cache: add 17598, delete 7604, find 216/258, race 0+0 May 2 13:05:34 jupiter kernel: Free swap = 2862436kB May 2 13:05:34 jupiter kernel: Total swap = 2931704kB May 2 13:05:34 jupiter kernel: Free swap: 2862436kB May 2 13:05:34 jupiter kernel: 66048 pages of RAM May 2 13:05:34 jupiter kernel: 0 pages of HIGHMEM May 2 13:05:34 jupiter kernel: 17980 reserved pages May 2 13:05:34 jupiter kernel: 354 pages shared May 2 13:05:34 jupiter kernel: 9996 pages swap cached May 2 13:05:34 jupiter kernel: 0 pages dirty May 2 13:05:34 jupiter kernel: 9985 pages writeback May 2 13:05:34 jupiter kernel: 7 pages mapped May 2 13:05:34 jupiter kernel: 6378 pages slab May 2 13:05:34 jupiter kernel: 207 pages pagetables May 2 13:05:34 jupiter kernel: Out of Memory: Kill process 3509 (slapd) score 23575 and children. May 2 13:05:34 jupiter kernel: Out of memory: Killed process 3509 (slapd). May 2 13:05:34 jupiter kernel: oom-killer: gfp_mask=0x200d2, order=0 May 2 13:05:34 jupiter kernel: [<c013f243>] out_of_memory+0x25/0x13a May 2 13:05:34 jupiter kernel: [<c0140721>] __alloc_pages+0x1f5/0x275 and then it basically goes boom.... All the drbd volumes have a total of less than 50GB, so according to the hints that I found on the web (32kB RAM usage per GB) it should only consume a few MB. What am I doing wrong here? Isn't this supposed to work? How much RAM does DRBD need? Thanks for any hints, Rene -- ------------------------------------------------- Gibraltar firewall http://www.gibraltar.at/ -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: <http://lists.linbit.com/pipermail/drbd-user/attachments/20070503/f5db426a/attachment.pgp>