[DRBD-user] mount seconday read-write

Fri Jun 29 00:10:33 CEST 2007

On Thu, Jun 28, 2007 at 09:58:36PM +0200, H.D. wrote:
> Lars Ellenberg wrote:
> > whether or not this is noticable depends very much on the actual
> > workload, and on the lower level io subsystem.
> 
> Right, but you always have to do backups and backups always cost
> performance. I think one should plan systems with that in mind and
> enough headroom from the beginning.

d'accord!

> As the secondary box is usually less loaded than the primary, I prefer
> to do it there.

I did not mean to disqualify your recommendation :)

we do that as well, and it is has proven very useful,
especially if you have a select-intensive (mainly cpu bound)
application running...

it's just that "the primary won't even notice"
may lead to wrong expectations,
the non-fullfilment of which would then be blamed on drbd.
so I thought I mention the caveats.

> We run PostgreSQL under heavy OLTP load and drop back from ~1500  to
> ~1150 TPS (test case, pgbench) during the secondary creating backup. So
> yes, there sure is a big performance penalty, but as we have more than
> enough headroom left, I have not bothered yet to move the snaps to
> another PV.

what is your backend storage?
kernel?
lvm version?
what does "lvs -o +devices --segments" say?
filesystem?
RAM?
cpus?

I've seen the case where the primary would even start the infamous
"ko counts", and then "connection cycles".
this has been on 2.6.16-43, lvm 2.02.05, on md raid1,
2x250G Sata drives, 120GB origin, snap 15G,
xfs, 24G RAM, 4way amd opteron.
yes, aparently the io-subsystem was underpowered :)

most of the time it "just worked" fine.
sometimes the snap even filled up considerably,
since we also drive reports there, and they can take very long.

BUT e.g. when the primary has some auto_vacuum active,
or otherwise updates its disk madly,
the secondary's io subsystem aparently
goes too much thrashing and drbd gets the hickups...

I've even seen the secondary stay for ~20minutes in "NetworkFailure",
after the Primary dropt the connection because of the "ko count".
and that is supposed to be a transitory state which normaly resolves to
either Unconnected or WFConnection in subseconds.
note, however, that even then it recovers and syncs up,
unless of course something else went wrong during the degraded phase.

in short, in my experience having the snapshot
on the same pv as the origin may cause serious pain.

ymmv

 :)

-- 
: Lars Ellenberg                            Tel +43-1-8178292-0  :
: LINBIT Information Technologies GmbH      Fax +43-1-8178292-82 :
: Vivenotgasse 48, A-1120 Vienna/Europe    http://www.linbit.com :
__
please use the "List-Reply" function of your email client.