Subject Re: Batch insert performance
Author Adam
--- In, Richard Wesley <hawkfish@...>
> Hey Adam -
> On Aug 23, 2006, at 16:56, Adam wrote:
> > Just doing some math on your feedback, each of your rows is at least
> > 625 bytes, and I am not sure whether you have counted the 2 bytes
> > overhead on each varchar field (which would be an additional 40
> > bytes), or whether you are using some multi-byte characterset. Each
> > bigint, double and timestamp is another 8 bytes, and there are other
> > pieces of information that must be stored against each record for
> > Firebird internals.
> I hadn't done the precise math (because this is a general purpose
> application), but this brings up another thing I was wondering
> about. In another thread there was a discussion of the optimal page
> size for OLAP being the maximum page size (at least for Oracle).
> Does anyone know whether that is a reasonable idiom for FB? We do a
> fair bit of aggregation (e.g. SUM) in some of our queries and that
> seems to bog FB down sometimes.

Well you row size is over 0.5K, probably approaching 1K. This means
that on your data page you are going to be getting only a couple of
records per page, and you have new record versions that are also
attempted to be stored on the same page.

I don't know if there is any simple rule for calculating page size,
otherwise you would probably have a tool that could look at a
database, then backup-restore using the ideal page size. The proof is
in the pudding.

A larger page size will mean more data per page, which is less round
trips to the disk. It means that caches are larger (well, the caches
are measured in pages, so unless you reduce the number of pages, the
cache will hold more data). The most efficient way to find the ideal
page size is to simply change it and do the experiment again.

Perhaps there is something you can do to break apart the extremely
wide table, and use an attribute table to store the 'field names' of
the wide table, then a table that holds the value for a particular
combination of the attribute and the wide table pk.

I am pretty sure that Sean advised the other day to not apply Oracle
configuration page size logic to a Firebird database.

> > If you had only 4 bigint fields, your record size would be 32 bytes
> > (plus record overhead), in other words you are looking at a factor of
> > 10-20x storage overhead over the simple table with 4 bigints. That
> > would take you up to 2.5 - 5K records per second, which depending on
> > the hardware and configuration may or may not be reasonable.
> >
> > I am not suggesting that the performance is good or acceptable, nor am
> > I suggesting that it can not be improved by batching, just pointing
> > out some facts so that your expectations are reasonable.
> >
> >>
> >> The actual layout of the data is not so important as I am looking for
> >> a general solution - not just one for this customer. We have
> >> Firebird embedded in our product for use as a local SQL OLAP engine
> >> and our typical usage scenario is to take a sample of a user's
> >> completely arbitrary database and load it into a local FB database
> >> for off-line analysis.
> >>
> >>>> Would
> >>>> batching the inserts improve things?
> >>>
> >>> How often are you committing the rows?
> >>
> >> Only at the end. I was wondering if updating the frequency would
> >> improve things. TFB mentions batching in the range 5-20k rows when
> >> you can't avoid indexing and I was wondering if there were other
> >> cases where this would apply?
> >
> > Firebird maintains an undo log in memory in case you want to rollback.
> > Don't quote me on it, but I think after about 5K operations the server
> > decides this log is too large and abandons it. Given that this is an
> > 'offline process' and you probably don't have to worry about atomicity
> > so much, you can probably manipulate your code to commit every so
> > often.
> This is sounding like something worth trying. Do I have to re-
> prepare a statement if I call CommitRetaining?
> > One thing you might try which may buy you better speed is to insert
> > the data into an 'external table', then run an insert into select from
> > query to pump it into the real table. This will avoid the round trips
> > at least.
> Since this is an embedded application, I'm not sure that the process
> through the API would be much slower than FB using its own API
> internally. But it is an interesting point and has me wondering
> about my own internal overhead which involves converting utf-16 to
> utf-8 for each string.

It would be useful to measure the speed of this conversion. The rule
with optimisation is to always work on the inefficiency that is
costing the most to the performance, until performance is acceptable.

Also, don't lose sight of the price of hardware. It is (generally
speaking) not worth spending weeks of work to deliver something that
could have been delivered immediately with a small RAM upgrade. Of
course increasing efficiency also adds to the scalability, so it is
never a lost cause.


PS. Sorry if this is a repost, Yahoo Groups has been quite unreliable
this week, unavailable half the time and delaying message posts for up
to a day.