[DRBD-user] strange drbd bug

Lars Ellenberg lars.ellenberg at linbit.com
Tue Oct 14 22:03:28 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Tue, Oct 14, 2014 at 08:48:05PM +0200, Matteo Tescione wrote:
> Lars,
> 
>  
> > exact DRBD version?
> > in tree, out of tree?
> > kernel? patches applied? which ones?
> > 
> > DRBD Configuration and IO stack?
> > What do you do to trigger it?
> > How easy can you reproduce?
> > Logs leading up to this?
> 
> 8.4.4 downloaded from official repository, compiled out of tree.
> Kernel is 3.12.14
> with only scst req_fifo patch

I don't know about that one

> & iscsi_putpage_callback patch applied.

But last time I've seen this one,
it did not properly work together with DRBD.
Possibly causing potential memory corruption, even? Not sure.
Has this been revisited?

> I do nothing, the 2 box are interconnected with 40-gbit back-to-back
> infiniband qle7342, but I seen several disconnection event repeating
> random even with no IO, and noticed that once in a month the primary
> box crashes. So I turned on the console logging and captured the trace
> you see below.
> 
> As I said, occurs only once in 30-40 days.
> 
> Logs before this panic:

Do you like double newlines,
or is this mangled on my end only?

> drbd TOP: Restarting receiver thread
> drbd TOP: receiver (re)started
> drbd TOP: conn( Unconnected -> WFConnection )
> drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400)
> drbd TOP: fence-peer helper returned 4 (peer was fenced)
> drbd TOP: pdsk( DUnknown -> Outdated )
> drbd TOP: Handshake successful: Agreed network protocol version 101
> drbd TOP: Agreed to support TRIM on protocol level
> drbd TOP: Peer authenticated using 20 bytes HMAC
> drbd TOP: conn( WFConnection -> WFReportParams )
> drbd TOP: Starting asender thread (from drbd_r_TOP [4385])
> block drbd0: drbd_sync_handshake:
> block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0
> block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0
> block drbd0: uuid_compare()=1 by rule 70
> block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
> block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
> block drbd1: drbd_sync_handshake:
> block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0
> block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0
> block drbd1: uuid_compare()=1 by rule 70
> block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
> block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
> block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
> block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]).
> block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995
> block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
> block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]).
> block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB
> block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec)
> block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D
> block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec)
> block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7
> block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> block drbd1: Remote failed to finish a request within ko-count * timeout
> block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> drbd TOP: asender terminated
> drbd TOP: Terminating drbd_a_TOP
> block drbd1: md_sync_timer expired! Worker calls drbd_md_sync().
> block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D
> drbd TOP: Connection closed
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP:  mask = 0x1f0 val = 0x20
> drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected
> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
> drbd TOP: receiver terminated
> drbd TOP: Restarting receiver thread
> drbd TOP: receiver (re)started
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0:  wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP:  mask = 0x1f0 val = 0x80
> drbd TOP:  old_conn:WFReportParams wanted_conn:WFConnection
> drbd TOP: Connection closed
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP:  mask = 0x1f0 val = 0x20
> drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected
> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
> drbd TOP: receiver terminated
> drbd TOP: Terminating drbd_r_TOP
> block drbd0: Remote failed to finish a request within ko-count * timeout
> block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )


Looks busy, but "normal" failure cases, so far.

> drbd TOP: Starting receiver thread (from swapper/0 [0])

That is strange, the receiver thread is supposed to be started from the
"drbd worker" thread?

> BUG: scheduling while atomic: swapper/0/0/0x00000100

Boom.

> > The stack traces look unusual.


	Lars



More information about the drbd-user mailing list