Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Jun 19, 2007 at 11:08:04AM +0200, H.D. wrote: > After an `drbdadm invalidate all' on the secondary, I got that line in > the logs of the primary. Short after that the secondary machine crashed. > It was at 3-4% of the resync. > > I don't know `how' it crashed, it just showed a black screen and was > completely hung. > > Thanks for a reply. which drbd version is this? > drbd0: conn( Connected -> StartingSyncS ) pdsk( UpToDate -> Inconsistent ) > drbd0: Writing meta data super block now. > drbd0: writing of bitmap took 20 jiffies > drbd0: 300 GB marked out-of-sync by on disk bit-map. > drbd0: 314572800 KB now marked out-of-sync by on disk bit-map. > drbd0: Writing meta data super block now. > drbd0: conn( StartingSyncS -> SyncSource ) > drbd0: Began resync as SyncSource (will sync 314572800 KB [78643200 bits > set]). > drbd0: Writing meta data super block now. > drbd0: PingAck did not arrive in time. > drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) > drbd0: asender terminated > drbd0: drbd_pp_alloc interrupted! > drbd0: alloc_ee: Allocation of a page failed interessting. aparently some hard out-of-memory situation... we usually hanle them as gracefully as possible, but there may still be bugs lurking. it may also have triggered some other resource starvation deadlock. > drbd0: error receiving RSDataRequest, l: 24! > drbd0: drbd_send_block() failed > drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync(). this is not a "BUG" in the sense of kernel BUG(), but a hint for us to investigate a _possible_ "logic bug", implicitly updates the on-disk meta data where we should have done so explicitly. it may be a hint about a dead thread, still, but since there is nothing else showing up here, this seems unlikely. > drbd0: Writing meta data super block now. > drbd0: tl_clear() > drbd0: Connection closed > drbd0: conn( NetworkFailure -> Unconnected ) > drbd0: receiver terminated > drbd0: receiver (re)started > drbd0: conn( Unconnected -> WFConnection ) > e1000: repl2: e1000_watchdog: NIC Link is Down > e1000: repl1: e1000_watchdog: NIC Link is Down your nic seems very unhappy about all that traffic suddenly going on. so maybe it is even hardware, after all, or misbehaving NIC driver? maybe even bad ram? > bonding: bond0: link status definitely down for interface repl1, > disabling it > bonding: bond0: link status definitely down for interface repl2, > disabling it > bonding: bond0: now running without any active interface ! -- : Lars Ellenberg Tel +43-1-8178292-0 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Vivenotgasse 48, A-1120 Vienna/Europe http://www.linbit.com : __ please use the "List-Reply" function of your email client.