[DRBD-user] drbd logic bug?

Lars Ellenberg lars.ellenberg at linbit.com
Fri Oct 12 00:31:27 CEST 2007

Note: "permalinks" may not be as permanent as we would like,
direct links of old sources may well be a few messages off.


On Thu, Oct 11, 2007 at 03:10:47PM -0700, Kelsey Cummings wrote:
> On Fri, Oct 12, 2007 at 12:00:04AM +0200, Lars Ellenberg wrote:
> ...
> > > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771449 > rs_total=416104 (rs_failed 0)
> > > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771447 > rs_total=416104 (rs_failed 0)
> > > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=771446 > rs_total=416104 (rs_failed 0)
> > > Oct 11 13:09:09 a kernel: drbd1: logic bug? rs_left=768609 > rs_total=416104 (rs_failed 0)
> > > Oct 11 13:09:11 a kernel: drbd1: logic bug? rs_left=762068 > rs_total=416104 (rs_failed 0)
> > 
> > now, that is interessting...
> > probably indeed a "logic bug", in that we don't get the accounting
> > right. I believe thought that drbd does the "right thing" still,
> > and only gets the counter wrong for some reason.  we'll have a look.
> 
> I left the cron'd reboots in place and it seems to have done the right
> thing as you say.  Tests are still running fine and the filesystems 
> appear to be intact even though they have swapped to and from the other
> node.
> 
> However, I'm now seeing the b.node with similar errors and the status
> reports that it is stalled.
> 
> on b
> # cat /proc/drbd 
> version: 8.0.6 (api:86/proto:86)
> SVN Revision: 3048 build by root at a.store, 2007-09-10 15:25:56
>  0: cs:Connected st:Primary/Secondary ds:UpToDate/UpToDate C r---
>     ns:4542096 nr:54038340 dw:58580464 dr:2043989 al:1135 bm:746 lo:0 pe:0 ua:0 ap:0
>         resync: used:0/31 hits:241212 misses:264 starving:0 dirty:0 changed:264
>         act_log: used:0/521 hits:1134396 misses:1135 starving:0 dirty:0 changed:1135
>  1: cs:SyncSource st:Primary/Secondary ds:UpToDate/Inconsistent C r---
>     ns:5878776 nr:58098092 dw:63336632 dr:3669653 al:1307 bm:841 lo:0 pe:352 ua:0 ap:352
>         [================>...] sync'ed: 84.7% (544436/3528760)K
>         stalled
>         resync: used:0/31 hits:221339 misses:273 starving:0 dirty:0 changed:273
>         act_log: used:2/521 hits:1308328 misses:1615 starving:0 dirty:308 changed:1307


in the "resync process" in drbd, the SyncTarget "knows" what blocks it
needs, and requests them. once it is done, it goes "Connected".
the SyncSource starts "knowing" also, but then during the "pause",
aparently loses track of the "cleared by application write" bits,
leading to it expecting more syncer requests from the peer.

you should be able to get ou of this state by simply doing
"disconnect; reconnect".

> on a
> version: 8.0.6 (api:86/proto:86)
> SVN Revision: 3048 build by root at a.store.usenetbinaries.com, 2007-09-10 15:25:56
>  0: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>     ns:0 nr:4970524 dw:4970524 dr:0 al:0 bm:126 lo:0 pe:0 ua:0 ap:0
>         resync: used:0/31 hits:121070 misses:126 starving:0 dirty:0 changed:126
>         act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
>  1: cs:Connected st:Secondary/Primary ds:UpToDate/UpToDate C r---
>     ns:0 nr:6246352 dw:6246352 dr:0 al:0 bm:273 lo:0 pe:0 ua:0 ap:0
>         resync: used:0/31 hits:221339 misses:273 starving:0 dirty:0 changed:273
>         act_log: used:0/521 hits:0 misses:0 starving:0 dirty:0 changed:0
> 
> ...
> drbd1: conn( WFConnection -> WFReportParams ) 
> drbd1: Handshake successful: DRBD Network Protocol version 86
> drbd1: Peer authenticated using 20 bytes of 'sha1' HMAC
> drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) 
> drbd1: Writing meta data super block now.
> drbd1: aftr_isp( 0 -> 1 ) 
> drbd1: conn( WFBitMapS -> PausedSyncS ) pdsk( Outdated -> Inconsistent ) 
> drbd1: Began resync as PausedSyncS (will sync 3528760 KB [882190 bits set]).
> drbd1: Writing meta data super block now.
> drbd1: pdsk( Inconsistent -> Outdated ) peer_isp( 0 -> 1 ) 

> drbd1: lost connection while grabbing the req_lock!
^^^^^
this one is probably the interesting one,
and where the "logic bug" is in.

> drbd1: pdsk( Outdated -> Inconsistent ) peer_isp( 1 -> 0 ) 
> drbd1: conn( PausedSyncS -> SyncSource ) aftr_isp( 1 -> 0 ) 
> drbd1: Syncer continues.
> drbd1: logic bug? rs_left=1015632 > rs_total=882190 (rs_failed 0)
> drbd1: logic bug? rs_left=1015602 > rs_total=882190 (rs_failed 0)
> drbd1: logic bug? rs_left=1015574 > rs_total=882190 (rs_failed 0)
> drbd1: logic bug? rs_left=1015566 > rs_total=882190 (rs_failed 0)
> drbd1: logic bug? rs_left=1015557 > rs_total=882190 (rs_failed 0)
> ...
> 
> Let me know if there is any things else I can do to help squash this.

will do, thanks.

-- 
: commercial DRBD/HA support and consulting: sales at linbit.com :
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.



More information about the drbd-user mailing list