Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.
Hello, Sorry for the double new lines but the trace is saved via serial console. Forgot to say, I'm using disable_sendpage=1 in my module options. Furthermore the drbd disconnect/reconnect happens random even with almost no IO, I guess when this happens with more IO it bugs... Best Regards, -- matteo > Il giorno 14/ott/2014, alle ore 22:03, Lars Ellenberg <lars.ellenberg at linbit.com> ha scritto: > >> On Tue, Oct 14, 2014 at 08:48:05PM +0200, Matteo Tescione wrote: >> Lars, >> >> >>> exact DRBD version? >>> in tree, out of tree? >>> kernel? patches applied? which ones? >>> >>> DRBD Configuration and IO stack? >>> What do you do to trigger it? >>> How easy can you reproduce? >>> Logs leading up to this? >> >> 8.4.4 downloaded from official repository, compiled out of tree. >> Kernel is 3.12.14 >> with only scst req_fifo patch > > I don't know about that one > >> & iscsi_putpage_callback patch applied. > > But last time I've seen this one, > it did not properly work together with DRBD. > Possibly causing potential memory corruption, even? Not sure. > Has this been revisited? > >> I do nothing, the 2 box are interconnected with 40-gbit back-to-back >> infiniband qle7342, but I seen several disconnection event repeating >> random even with no IO, and noticed that once in a month the primary >> box crashes. So I turned on the console logging and captured the trace >> you see below. >> >> As I said, occurs only once in 30-40 days. >> >> Logs before this panic: > > Do you like double newlines, > or is this mangled on my end only? > >> drbd TOP: Restarting receiver thread >> drbd TOP: receiver (re)started >> drbd TOP: conn( Unconnected -> WFConnection ) >> drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400) >> drbd TOP: fence-peer helper returned 4 (peer was fenced) >> drbd TOP: pdsk( DUnknown -> Outdated ) >> drbd TOP: Handshake successful: Agreed network protocol version 101 >> drbd TOP: Agreed to support TRIM on protocol level >> drbd TOP: Peer authenticated using 20 bytes HMAC >> drbd TOP: conn( WFConnection -> WFReportParams ) >> drbd TOP: Starting asender thread (from drbd_r_TOP [4385]) >> block drbd0: drbd_sync_handshake: >> block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0 >> block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0 >> block drbd0: uuid_compare()=1 by rule 70 >> block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) >> block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% >> block drbd1: drbd_sync_handshake: >> block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0 >> block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0 >> block drbd1: uuid_compare()=1 by rule 70 >> block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent ) >> block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% >> block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0% >> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 >> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0) >> block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) >> block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]). >> block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995 >> block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0% >> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 >> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0) >> block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent ) >> block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]). >> block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB >> block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec) >> block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D >> block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) >> block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec) >> block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7 >> block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate ) >> block drbd1: Remote failed to finish a request within ko-count * timeout >> block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) >> drbd TOP: asender terminated >> drbd TOP: Terminating drbd_a_TOP >> block drbd1: md_sync_timer expired! Worker calls drbd_md_sync(). >> block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D >> drbd TOP: Connection closed >> block drbd0: State change failed: Refusing to be Primary while peer is not outdated >> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } >> block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } >> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated >> drbd TOP: mask = 0x1f0 val = 0x20 >> drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected >> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS >> drbd TOP: receiver terminated >> drbd TOP: Restarting receiver thread >> drbd TOP: receiver (re)started >> block drbd0: State change failed: Refusing to be Primary while peer is not outdated >> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } >> block drbd0: wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- } >> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated >> drbd TOP: mask = 0x1f0 val = 0x80 >> drbd TOP: old_conn:WFReportParams wanted_conn:WFConnection >> drbd TOP: Connection closed >> block drbd0: State change failed: Refusing to be Primary while peer is not outdated >> block drbd0: state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- } >> block drbd0: wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- } >> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated >> drbd TOP: mask = 0x1f0 val = 0x20 >> drbd TOP: old_conn:WFReportParams wanted_conn:Unconnected >> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS >> drbd TOP: receiver terminated >> drbd TOP: Terminating drbd_r_TOP >> block drbd0: Remote failed to finish a request within ko-count * timeout >> block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown ) > > > Looks busy, but "normal" failure cases, so far. > >> drbd TOP: Starting receiver thread (from swapper/0 [0]) > > That is strange, the receiver thread is supposed to be started from the > "drbd worker" thread? > >> BUG: scheduling while atomic: swapper/0/0/0x00000100 > > Boom. > >>> The stack traces look unusual. > > > Lars