[Drbd-dev] drbd 8.4.3: refcounter overflow on re-sync

Wed Sep 24 14:50:22 CEST 2014

On Wed, Sep 24, 2014 at 12:14:51PM +0200, Lars Ellenberg wrote:
> On Tue, Sep 23, 2014 at 08:14:21PM +0200, Marc Schiffbauer wrote:
> > * Lars Ellenberg schrieb am 23.09.14 um 13:03 Uhr:
> > >On Fri, Sep 19, 2014 at 05:16:53PM +0200, Marc Schiffbauer wrote:
> > >>* Lars Ellenberg schrieb am 19.09.14 um 16:48 Uhr:
> > >>>On Fri, Sep 19, 2014 at 11:49:09AM +0200, Marc Schiffbauer wrote:
> > >>>>Hi,
> > >>>>
> > >>>
> > >>>If you resolve that to a code line,
> > >>>I may be able to figure out what PAX is talking about.
> > >>>
> > >>>But from this stack trace alone, I have absolutely no idea what PAX
> > >>>is trying to say, which refcount could possibly be meant there,
> > >>>let alone why it could possibly overflow or.
> > >>>
> > >>>Ah, ok. Looking at [1], "PaX Team" says:
> > >>>.---
> > >>>| after having looked at the drbd code a bit i think this could be a
> > >>>| real bug in drbd but only upstream can tell for sure so you'll have to
> > >>>| contact them. you can show them the following that i figured out so far:
> > >>>|
> > >>>| the refcount overflow was detected in
> > >>>| drivers/block/drbd/drbd_bitmap.c:bm_page_io_async at the
> > >>>|
> > >>>| atomic_add(len >> 9, &mdev->rs_sect_ev)
> > >>>
> > >>>Well, yes, why would it not overflow.
> > >>>It is *not* a refcount.
> > >>>It is an atomic counter.
> > >>>It is meant to overflow.
> > 
> > 
> > Another question PaX-Team is asking:
> > 
> > what about rs_sect_in?
> 
> That usually should not overflow, as it is typically regularly (several
> times per second) reset to zero (and for other reasons).
> 
> If you manage to transfer more 2 TiByte in subseconds via a single TCP
> connection, more power to you.
> 
> Still, if it should overflow (for whatever reason), no real harm done.
> Arbitrarily sending a signal or terminating processes in that case would
> be the only actually disturbing thing.

Ok.
So what PAX really is doing is redefine "atomic_add" and similar to
basically become a no-op, if it would overflow.

typedef struct { int counter } atomic_t;
void atomic_add(int i, atomic_t *v)
{
	v->counter += i;
	if (that_caused_a_counter_wrap_in_any_direction) {
		/* oops, overflow */
		SCREAM("help me, overflow...");
		v->counter -= i;
	}
}

If that *is* really an object refcount,
and somewhere would be
	if (atomic_dec_and_test(that_count))
		free(some_object);
then ok, you have replace one bug
with an error message and different bug.
Might help with debugging.  Not with much else. 

But really.
Precautionary changing (x + y) to be silently identical to (x + 0),
"just in case", will surely generally improve program flow...  D'oh.

Anyways, now that I know PAX is really just keeping that counter
at a fixed value of INT_MAX in this case, and nothing else,
what would have caused DRBD to disconnect/reconnect?

Could that have been you?

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.