[DRBD-user] drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync().

H.D. devnull at deleted.on.request
Tue Jun 19 11:40:45 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


8.0.3 on

Linux blackbird 2.6.21-gentoo #3 SMP Wed May 23 11:53:38 CEST 2007 
x86_64 Intel(R) Xeon(R) CPU            5130  @ 2.00GHz GenuineIntel 
GNU/Linux

I now have to log of the crashed secondary machine short before the reboot:

Jun 19 10:47:50 phoenix drbd0: conn( Connected -> StartingSyncT ) disk( 
UpToDate -> Inconsistent )
Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now.
Jun 19 10:47:50 phoenix drbd0: writing of bitmap took 19 jiffies
Jun 19 10:47:50 phoenix drbd0: 300 GB marked out-of-sync by on disk bit-map.
Jun 19 10:47:50 phoenix drbd0: 314572800 KB now marked out-of-sync by on 
disk bit-map.
Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now.
Jun 19 10:47:50 phoenix drbd0: conn( StartingSyncT -> WFSyncUUID )
Jun 19 10:47:50 phoenix drbd0: conn( WFSyncUUID -> SyncTarget )
Jun 19 10:47:50 phoenix drbd0: Began resync as SyncTarget (will sync 
314572800 KB [78643200 bits set]).
Jun 19 10:47:50 phoenix drbd0: Writing meta data super block now.
Jun 19 10:58:16 phoenix Linux version 2.6.21-gentoo (root at blackbird) 
(gcc version 4.1.1 (Gentoo 4.1.1-r3)) #3 SMP Wed May 23 11:53:38 CEST 2007
Jun 19 10:58:16 phoenix Command line: root=/dev/sda1 
rootflags="nobarrier,bsdgroups,prjquota,inode64"

Though there seems not to be something in it..

Thanks for your help.



On 19.06.2007 11:30, Lars Ellenberg wrote:
> On Tue, Jun 19, 2007 at 11:08:04AM +0200, H.D. wrote:
>> After an `drbdadm invalidate all' on the secondary, I got that line in 
>> the logs of the primary. Short after that the secondary machine crashed. 
>> It was at 3-4% of the resync.
>>
>> I don't know `how' it crashed, it just showed a black screen and was 
>> completely hung.
>>
>> Thanks for a reply.
> 
> which drbd version is this?
> 
>> drbd0: conn( Connected -> StartingSyncS ) pdsk( UpToDate -> Inconsistent )
>> drbd0: Writing meta data super block now.
>> drbd0: writing of bitmap took 20 jiffies
>> drbd0: 300 GB marked out-of-sync by on disk bit-map.
>> drbd0: 314572800 KB now marked out-of-sync by on disk bit-map.
>> drbd0: Writing meta data super block now.
>> drbd0: conn( StartingSyncS -> SyncSource )
>> drbd0: Began resync as SyncSource (will sync 314572800 KB [78643200 bits 
>> set]).
>> drbd0: Writing meta data super block now.
>> drbd0: PingAck did not arrive in time.
>> drbd0: peer( Secondary -> Unknown ) conn( SyncSource -> NetworkFailure )
>> drbd0: asender terminated
>> drbd0: drbd_pp_alloc interrupted!
>> drbd0: alloc_ee: Allocation of a page failed
> 
> interessting. aparently some hard out-of-memory situation...
> we usually hanle them as gracefully as possible,
> but there may still be bugs lurking.
> it may also have triggered some other resource starvation deadlock.
> 
>> drbd0: error receiving RSDataRequest, l: 24!
>> drbd0: drbd_send_block() failed
>> drbd0: BUG! md_sync_timer expired! Worker calls drbd_md_sync().
> 
> this is not a "BUG" in the sense of kernel BUG(),
> but a hint for us to investigate a  _possible_ "logic bug",
> implicitly updates the on-disk meta data
> where we should have done so explicitly.
> 
> it may be a hint about a dead thread, still,
> but since there is nothing else showing up here,
> this seems unlikely.
> 
>> drbd0: Writing meta data super block now.
>> drbd0: tl_clear()
>> drbd0: Connection closed
>> drbd0: conn( NetworkFailure -> Unconnected )
>> drbd0: receiver terminated
>> drbd0: receiver (re)started
>> drbd0: conn( Unconnected -> WFConnection )
>> e1000: repl2: e1000_watchdog: NIC Link is Down
>> e1000: repl1: e1000_watchdog: NIC Link is Down
> 
> your nic seems very unhappy about all that traffic suddenly going on.
> so maybe it is even hardware, after all,
> or misbehaving NIC driver?
> maybe even bad ram?
> 
>> bonding: bond0: link status definitely down for interface repl1, 
>> disabling it
>> bonding: bond0: link status definitely down for interface repl2, 
>> disabling it
>> bonding: bond0: now running without any active interface !
> 

-- 
Regards,
H.D.



More information about the drbd-user mailing list