[DRBD-user] strange drbd bug

Matteo Tescione matteo at rmnet.it
Wed Oct 15 12:02:09 CEST 2014

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


Hello,

Sorry for the double new lines but the trace is saved via serial console.

Forgot to say, I'm using disable_sendpage=1 in my module options.
Furthermore the drbd disconnect/reconnect happens random even with almost no IO, I guess when this happens with more IO it bugs...

Best Regards,
--
matteo

> Il giorno 14/ott/2014, alle ore 22:03, Lars Ellenberg <lars.ellenberg at linbit.com> ha scritto:
> 
>> On Tue, Oct 14, 2014 at 08:48:05PM +0200, Matteo Tescione wrote:
>> Lars,
>> 
>> 
>>> exact DRBD version?
>>> in tree, out of tree?
>>> kernel? patches applied? which ones?
>>> 
>>> DRBD Configuration and IO stack?
>>> What do you do to trigger it?
>>> How easy can you reproduce?
>>> Logs leading up to this?
>> 
>> 8.4.4 downloaded from official repository, compiled out of tree.
>> Kernel is 3.12.14
>> with only scst req_fifo patch
> 
> I don't know about that one
> 
>> & iscsi_putpage_callback patch applied.
> 
> But last time I've seen this one,
> it did not properly work together with DRBD.
> Possibly causing potential memory corruption, even? Not sure.
> Has this been revisited?
> 
>> I do nothing, the 2 box are interconnected with 40-gbit back-to-back
>> infiniband qle7342, but I seen several disconnection event repeating
>> random even with no IO, and noticed that once in a month the primary
>> box crashes. So I turned on the console logging and captured the trace
>> you see below.
>> 
>> As I said, occurs only once in 30-40 days.
>> 
>> Logs before this panic:
> 
> Do you like double newlines,
> or is this mangled on my end only?
> 
>> drbd TOP: Restarting receiver thread
>> drbd TOP: receiver (re)started
>> drbd TOP: conn( Unconnected -> WFConnection )
>> drbd TOP: helper command: /sbin/drbdadm fence-peer TOP exit code 4 (0x400)
>> drbd TOP: fence-peer helper returned 4 (peer was fenced)
>> drbd TOP: pdsk( DUnknown -> Outdated )
>> drbd TOP: Handshake successful: Agreed network protocol version 101
>> drbd TOP: Agreed to support TRIM on protocol level
>> drbd TOP: Peer authenticated using 20 bytes HMAC
>> drbd TOP: conn( WFConnection -> WFReportParams )
>> drbd TOP: Starting asender thread (from drbd_r_TOP [4385])
>> block drbd0: drbd_sync_handshake:
>> block drbd0: self 34B181D069424F63:3921A73508F919D7:673BF468C9830995:673AF468C9830995 bits:59293 flags:0
>> block drbd0: peer 3921A73508F919D6:0000000000000000:673BF468C9830994:673AF468C9830995 bits:0 flags:0
>> block drbd0: uuid_compare()=1 by rule 70
>> block drbd0: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
>> block drbd0: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
>> block drbd1: drbd_sync_handshake:
>> block drbd1: self D532468B0E14258D:344C01173D6B201D:659C03CE91D88EBB:659B03CE91D88EBB bits:44 flags:0
>> block drbd1: peer 344C01173D6B201C:0000000000000000:659C03CE91D88EBA:659B03CE91D88EBB bits:0 flags:0
>> block drbd1: uuid_compare()=1 by rule 70
>> block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Consistent )
>> block drbd1: send bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
>> block drbd0: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 12494(4), total 12494; compression: 100.0%
>> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0
>> block drbd0: helper command: /sbin/drbdadm before-resync-source minor-0 exit code 0 (0x0)
>> block drbd0: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
>> block drbd0: Began resync as SyncSource (will sync 237172 KB [59293 bits set]).
>> block drbd0: updated sync UUID 34B181D069424F63:3922A73508F919D7:3921A73508F919D7:673BF468C9830995
>> block drbd1: receive bitmap stats [Bytes(packets)]: plain 0(0), RLE 120(1), total 120; compression: 100.0%
>> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1
>> block drbd1: helper command: /sbin/drbdadm before-resync-source minor-1 exit code 0 (0x0)
>> block drbd1: conn( WFBitMapS -> SyncSource ) pdsk( Consistent -> Inconsistent )
>> block drbd1: Began resync as SyncSource (will sync 196 KB [49 bits set]).
>> block drbd1: updated sync UUID D532468B0E14258D:344D01173D6B201D:344C01173D6B201D:659C03CE91D88EBB
>> block drbd1: Resync done (total 1 sec; paused 0 sec; 196 K/sec)
>> block drbd1: updated UUIDs D532468B0E14258D:0000000000000000:344D01173D6B201D:344C01173D6B201D
>> block drbd1: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
>> block drbd0: Resync done (total 6 sec; paused 0 sec; 39528 K/sec)
>> block drbd0: updated UUIDs 34B181D069424F63:0000000000000000:3922A73508F919D7:3921A73508F919D7
>> block drbd0: conn( SyncSource -> Connected ) pdsk( Inconsistent -> UpToDate )
>> block drbd1: Remote failed to finish a request within ko-count * timeout
>> block drbd1: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
>> drbd TOP: asender terminated
>> drbd TOP: Terminating drbd_a_TOP
>> block drbd1: md_sync_timer expired! Worker calls drbd_md_sync().
>> block drbd1: new current UUID 3C68CD64811C1C41:D532468B0E14258D:344D01173D6B201D:344C01173D6B201D
>> drbd TOP: Connection closed
>> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
>> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
>> block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
>> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
>> drbd TOP:  mask = 0x1f0 val = 0x20
>> drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected
>> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
>> drbd TOP: receiver terminated
>> drbd TOP: Restarting receiver thread
>> drbd TOP: receiver (re)started
>> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
>> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
>> block drbd0:  wanted = { cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
>> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
>> drbd TOP:  mask = 0x1f0 val = 0x80
>> drbd TOP:  old_conn:WFReportParams wanted_conn:WFConnection
>> drbd TOP: Connection closed
>> block drbd0: State change failed: Refusing to be Primary while peer is not outdated
>> block drbd0:   state = { cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate r----- }
>> block drbd0:  wanted = { cs:Unconnected ro:Primary/Unknown ds:UpToDate/DUnknown r----- }
>> drbd TOP: State change failed: Refusing to be Primary while peer is not outdated
>> drbd TOP:  mask = 0x1f0 val = 0x20
>> drbd TOP:  old_conn:WFReportParams wanted_conn:Unconnected
>> drbd TOP: Expected cstate < C_WF_REPORT_PARAMS
>> drbd TOP: receiver terminated
>> drbd TOP: Terminating drbd_r_TOP
>> block drbd0: Remote failed to finish a request within ko-count * timeout
>> block drbd0: peer( Secondary -> Unknown ) conn( Connected -> Timeout ) pdsk( UpToDate -> DUnknown )
> 
> 
> Looks busy, but "normal" failure cases, so far.
> 
>> drbd TOP: Starting receiver thread (from swapper/0 [0])
> 
> That is strange, the receiver thread is supposed to be started from the
> "drbd worker" thread?
> 
>> BUG: scheduling while atomic: swapper/0/0/0x00000100
> 
> Boom.
> 
>>> The stack traces look unusual.
> 
> 
>    Lars



More information about the drbd-user mailing list