Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Oct 14, 2014 at 08:48:05PM +0200, Matteo Tescione wrote:
> Lars,
>
>
> > exact DRBD version?
> > in tree, out of tree?
> > kernel? patches applied? which ones?
> >
> > DRBD Configuration and IO stack?
> > What do you do to trigger it?
> > How easy can you reproduce?
> > Logs leading up to this?
>
> 8.4.4 downloaded from official repository, compiled out of tree.
> Kernel is 3.12.14
> with only scst req_fifo patch
I don't know about that one
> & iscsi_putpage_callback patch applied.
But last time I've seen this one,
it did not properly work together with DRBD.
Possibly causing potential memory corruption, even? Not sure.
Has this been revisited?
> I do nothing, the 2 box are interconnected with 40-gbit back-to-back
> infiniband qle7342, but I seen several disconnection event repeating
> random even with no IO, and noticed that once in a month the primary
> box crashes. So I turned on the console logging and captured the trace
> you see below.
>
> As I said, occurs only once in 30-40 days.
>
> Logs before this panic:
Do you like double newlines,
or is this mangled on my end only?
> drbd TOP: Restarting receiver thread
> drbd TOP: receiver (re)started
> drbd TOP: conn( Unconnected -> WFConnection )
> drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400)
> drbd TOP: fence-peer helper returned 4 (peer was fenced)
> drbd TOP: pdsk( DUnknown -> Outdated )
> drbd TOP: Handshake successful: Agreed network protocol version 101
> drbd TOP: Agreed to support TRIM on protocol level
> drbd TOP: Peer authenticated using 20 bytes HMAC
> drbd TOP: conn( WFConnection -> WFReportParams )
> drbd TOP: Starting asender thread (from drbd_r_TOP [4385])
> block drbd0: drbd_sync_handshake:
> block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0
> block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0
> block drbd0: uuid_compare()=1 by rule 70
> block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
> block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
> block drbd1: drbd_sync_handshake:
> block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0
> block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0
> block drbd1: uuid_compare()=1 by rule 70
> block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
> block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
> block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
> block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]).
> block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995
> block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
> block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
> block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]).
> block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB
> block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec)
> block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D
> block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec)
> block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7
> block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
> block drbd1: Remote failed to finish a request within ko-count * timeout
> block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> drbd TOP: asender terminated
> drbd TOP: Terminating drbd_a_TOP
> block drbd1: md_sync_timer expired! Worker calls drbd_md_sync().
> block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D
> drbd TOP: Connection closed
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP: mask = 0x1f0 val = 0x20
> drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected
> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
> drbd TOP: receiver terminated
> drbd TOP: Restarting receiver thread
> drbd TOP: receiver (re)started
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP: mask = 0x1f0 val = 0x80
> drbd TOP: old_conn:WFReportParams wanted_conn:WFConnection
> drbd TOP: Connection closed
> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
> block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
> drbd TOP: mask = 0x1f0 val = 0x20
> drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected
> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
> drbd TOP: receiver terminated
> drbd TOP: Terminating drbd_r_TOP
> block drbd0: Remote failed to finish a request within ko-count * timeout
> block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
Looks busy, but "normal" failure cases, so far.
> drbd TOP: Starting receiver thread (from swapper/0 [0])
That is strange, the receiver thread is supposed to be started from the
"drbd worker" thread?
> BUG: scheduling while atomic: swapper/0/0/0x00000100
Boom.
> > The stack traces look unusual.
Lars