Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
On Tue, Oct 14, 2014 at 08:48:05PM +0200, Matteo Tescione wrote: > Lars, > > > > exact DRBD version? > > in tree, out of tree? > > kernel? patches applied? which ones? > > > > DRBD Configuration and IO stack? > > What do you do to trigger it? > > How easy can you reproduce? > > Logs leading up to this? > > 8.4.4 downloaded from official repository, compiled out of tree. > Kernel is 3.12.14 > with only scst req_fifo patch I don't know about that one > & iscsi_putpage_callback patch applied. But last time I've seen this one, it did not properly work together with DRBD. Possibly causing potential memory corruption, even? Not sure. Has this been revisited? > I do nothing, the 2 box are interconnected with 40-gbit back-to-back > infiniband qle7342, but I seen several disconnection event repeating > random even with no IO, and noticed that once in a month the primary > box crashes. So I turned on the console logging and captured the trace > you see below. > > As I said, occurs only once in 30-40 days. > > Logs before this panic: Do you like double newlines, or is this mangled on my end only? > drbd TOP: Restarting receiver thread > drbd TOP: receiver (re)started > drbd TOP: conn( Unconnected -> WFConnection ) > drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400) > drbd TOP: fence-peer helper returned 4 (peer was fenced) > drbd TOP: pdsk( DUnknown -> Outdated ) > drbd TOP: Handshake successful: Agreed network protocol version 101 > drbd TOP: Agreed to support TRIM on protocol level > drbd TOP: Peer authenticated using 20 bytes HMAC > drbd TOP: conn( WFConnection -> WFReportParams ) > drbd TOP: Starting asender thread (from drbd_r_TOP [4385]) > block drbd0: drbd_sync_handshake: > block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0 > block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0 > block drbd0: uuid_compare()=1 by rule 70 > block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) > block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% > block drbd1: drbd_sync_handshake: > block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0 > block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0 > block drbd1: uuid_compare()=1 by rule 70 > block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) > block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% > block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% > block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 > block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) > block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) > block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]). > block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995 > block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% > block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 > block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) > block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) > block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]). > block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB > block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec) > block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D > block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec) > block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7 > block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) > block drbd1: Remote failed to finish a request within ko-count * timeout > block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) > drbd TOP: asender terminated > drbd TOP: Terminating drbd_a_TOP > block drbd1: md_sync_timer expired! Worker calls drbd_md_sync(). > block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D > drbd TOP: Connection closed > block drbd0: State change failed: Refusing to be Primary while peer is not outdated > block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } > block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } > drbd TOP: State change failed: Refusing to be Primary while peer is not outdated > drbd TOP: mask = 0x1f0 val = 0x20 > drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected > drbd TOP: Expected cstate < C_WF_REPORT_PARAMS > drbd TOP: receiver terminated > drbd TOP: Restarting receiver thread > drbd TOP: receiver (re)started > block drbd0: State change failed: Refusing to be Primary while peer is not outdated > block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } > block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- } > drbd TOP: State change failed: Refusing to be Primary while peer is not outdated > drbd TOP: mask = 0x1f0 val = 0x80 > drbd TOP: old_conn:WFReportParams wanted_conn:WFConnection > drbd TOP: Connection closed > block drbd0: State change failed: Refusing to be Primary while peer is not outdated > block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } > block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } > drbd TOP: State change failed: Refusing to be Primary while peer is not outdated > drbd TOP: mask = 0x1f0 val = 0x20 > drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected > drbd TOP: Expected cstate < C_WF_REPORT_PARAMS > drbd TOP: receiver terminated > drbd TOP: Terminating drbd_r_TOP > block drbd0: Remote failed to finish a request within ko-count * timeout > block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) Looks busy, but "normal" failure cases, so far. > drbd TOP: Starting receiver thread (from swapper/0 [0]) That is strange, the receiver thread is supposed to be started from the "drbd worker" thread? > BUG: scheduling while atomic: swapper/0/0/0x00000100 Boom. > > The stack traces look unusual. Lars