Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
8.0.3 on Linux blackbird 2.6.21-gentoo #3 SMP Wed May 23 11:53:38 CEST 2007 x86_64 Intel(R) Xeon(R) CPU 5130 @ 2.00GHz GenuineIntel GNU/Linux I now have to log of the crashed secondary machine short before the reboot: Jun 19 10:47:50 phoenix drbd0: conn( Connected -> StartingSyncT ) disk( UpToDate -> Inconsistent ) Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now. Jun 19 10:47:50 phoenix drbd0: writing of bitmap took 19 jiffies Jun 19 10:47:50 phoenix drbd0: 300 GB marked out-of-sync by on disk bit-map. Jun 19 10:47:50 phoenix drbd0: 314572800 KB now marked out-of-sync by on disk bit-map. Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now. Jun 19 10:47:50 phoenix drbd0: conn( StartingSyncT -> WFSyncUUID ) Jun 19 10:47:50 phoenix drbd0: conn( WFSyncUUID -> SyncTarget ) Jun 19 10:47:50 phoenix drbd0: Began resync as SyncTarget (will sync 314572800 KB [78643200 bits set]). Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now. Jun 19 10:58:16 phoenix Linux version 2.6.21-gentoo (root at blackbird) (gcc version 4.1.1 (Gentoo 4.1.1-r3)) #3 SMP Wed May 23 11:53:38 CEST 2007 Jun 19 10:58:16 phoenix Command line: root=/dev/sda1 rootflags="nobarrier,bsdgroups,prjquota,inode64" Though there seems not to be something in it.. Thanks for your help. On 19.06.2007 11:30, Lars Ellenberg wrote: > On Tue, Jun 19, 2007 at 11:08:04AM +0200, H.D. wrote: >> After an `drbdadm invalidate all' on the secondary, I got that line in >> the logs of the primary. Short after that the secondary machine crashed. >> It was at 3-4% of the resync. >> >> I don't know `how' it crashed, it just showed a black screen and was >> completely hung. >> >> Thanks for a reply. > > which drbd version is this? > >> drbd0: conn( Connected -> StartingSyncS ) pdsk( UpToDate -> Inconsistent ) >> drbd0: Writing meta data super block now. >> drbd0: writing of bitmap took 20 jiffies >> drbd0: 300 GB marked out-of-sync by on disk bit-map. >> drbd0: 314572800 KB now marked out-of-sync by on disk bit-map. >> drbd0: Writing meta data super block now. >> drbd0: conn( StartingSyncS -> SyncSource ) >> drbd0: Began resync as SyncSource (will sync 314572800 KB [78643200 bits >> set]). >> drbd0: Writing meta data super block now. >> drbd0: PingAck did not arrive in time. >> drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure ) >> drbd0: asender terminated >> drbd0: drbd_pp_alloc interrupted! >> drbd0: alloc_ee: Allocation of a page failed > > interessting. aparently some hard out-of-memory situation... > we usually hanle them as gracefully as possible, > but there may still be bugs lurking. > it may also have triggered some other resource starvation deadlock. > >> drbd0: error receiving RSDataRequest, l: 24! >> drbd0: drbd_send_block() failed >> drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync(). > > this is not a "BUG" in the sense of kernel BUG(), > but a hint for us to investigate a _possible_ "logic bug", > implicitly updates the on-disk meta data > where we should have done so explicitly. > > it may be a hint about a dead thread, still, > but since there is nothing else showing up here, > this seems unlikely. > >> drbd0: Writing meta data super block now. >> drbd0: tl_clear() >> drbd0: Connection closed >> drbd0: conn( NetworkFailure -> Unconnected ) >> drbd0: receiver terminated >> drbd0: receiver (re)started >> drbd0: conn( Unconnected -> WFConnection ) >> e1000: repl2: e1000_watchdog: NIC Link is Down >> e1000: repl1: e1000_watchdog: NIC Link is Down > > your nic seems very unhappy about all that traffic suddenly going on. > so maybe it is even hardware, after all, > or misbehaving NIC driver? > maybe even bad ram? > >> bonding: bond0: link status definitely down for interface repl1, >> disabling it >> bonding: bond0: link status definitely down for interface repl2, >> disabling it >> bonding: bond0: now running without any active interface ! > -- Regards, H.D.