firebird-support - RE: [firebird-support] OLAP tuning in Firebird 2.1

Subject	RE: [firebird-support] OLAP tuning in Firebird 2.1
Author	Svein Erling Tysvær
Post date	2009-01-20T07:58:26Z

Processing 600000 records is always going to take a bit of time (not necessarily too long, but it will not be instant). I'd say the main problem with your query is that there is no WHERE clause to limit the number of records that needs to be examined.

If there is lots of duplicates, I suppose you could try to add another table to hold the unique value and populate it through a trigger (insert the record only when it does not exist). Then you could try something like:

SELECT DISTINCT tu."Object Source" AS "none:Object Source:nk"
FROM "TableauUnique" tu
WHERE EXISTS (SELECT * FROM "TableauExtract" te WHERE te."Object Source" = tu."Object Source")

Though I know nothing about neither OLAP nor whether "TableauExtract"."Object Source" contains lots of duplicates, so my answer may or may not be useful.

Set

-----Original Message-----
From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com] On Behalf Of Richard Wesley
Sent: 19. januar 2009 22:23
To: firebird-support@yahoogroups.com
Subject: Re: [firebird-support] OLAP tuning in Firebird 2.1

Hi Helen -

I forgot to mention that the table we are testing here has 600,000
rows and the dimension requested has cardinality 4. It is one of the
test data sets we have constructed that is wide but not long.

On 19 Jan 2009, at 12:14, Helen Borrie wrote:

> At 05:53 AM 20/01/2009, you wrote:
>
>> _tidx_128_30a INDEX ON TableauExtract(Object Source)
>>
>> So, here is the first question:
>>
>> -----
>> When I connect to the extract and drag [Object Source] to rows, we
>> run:
>>
>> Query 1: Trying to get the domain of [Object Source] by GROUP BY
>> SELECT ("TableauExtract"."Object Source") AS "none:Object Source:nk"
>> FROM "TableauExtract"
>> GROUP BY 1
>> </QUERY>
>> <PLAN>
>> PLAN (TableauExtract ORDER _tidx_128_30a)
>> </PLAN>
>> [Time] Running the query took 4.2705 sec.
>> [Time] Getting the records took 0.0002 sec.
>>
>> It takes roughly 4 seconds to table scan this data source so it
>> appears that this is doing a table scan based on time versus
>> retrieving the unique keys from the index.
>
> No. What it's doing here is constructing the whole set, which is a
> derived field (derived from the expression AS "none:Object
> Source:nk") and sorting that. In so doing, it recalculates the
> whole set and uses the index to assist in sorting that.

It sounds to me like you are saying that the AS clause makes the query
less efficient? If I leave it off, it takes the same amount of time,
so t his must not be what you mean.

>> That makes sense to me when
>> we include a measure, such as in the following, since computing the
>> measure will require touching all the records. But in the above, we
>> are only operating on the key of the index why touch the data pages?
>
> Because the output of the *retrieval* is expression output, not
> indexed data. Sorts operate on that output. Here you are giving
> the query a double-whammy.

So, to answer this request, Firebird retrieves the field from all
rows, sorts all these values, groups them and removes duplicates? I
would still expect the time here to be in the disk reads, not the sort
- which is confirmed by timing select count(*). (FWIW, the test
server is only used for performance testing and the disk cache is
warmed up.)

>> So the bottom line is: What is the fastest way to determine the
>> members of a column in a local database? Is there any way (e.g. by
>> fiddling with transaction settings) to avoid a table scan?
>
> Do queries that can be optimized by indexes. Amongst others, that
> means not trying to do sorts (including DISTINCTs) on the output of
> expressions.

But suppose we have a10-million row de-normalised table with an
indexed 10-member dimension. Can we get the list of dimension members
without doing a table scan or a GROUP BY/DISTINCT? It sounds like the
answer is "no". (Which is fine - we are just trying to understand our
problem space.)

Maybe I should go into the problem space a bit. We are taking subsets
of user data from a main server and copying it into a local Firebird
database so the user can do local OLAP. We are trying to make our use
of Firebird as performant as possible. We used to dump the data into
a flat relation, but we have been moving towards data normalisation to
improve performance and space utilisation (all the while keeping an
eye on the literature WRT column store performance vs. relational
performance). We find that with full normalisation and relation
culling, we can get excellent performance on some queries (e.g. the
query above gets an instantaneous response because after pruning the
fact table we are reading four rows from a dimension table) but for
typical OLAP queries (e.g. SELECT dimensions, aggregations FROM fact
table INNER JOIN relevant dimension tables GROUP BY dimensions) we
find that the performance is WORSE (often by a factor of 2), unless
the table is very wide and the subset of the dimensions is small.
This suggests to us that joins are expensive unless you eliminate
reading so many columns that the disk read savings outweigh the join
costs.

Does this sound right?

Incidentally, we are using Firebird because it is small, well tested
and has a very flexible and compliant data model (including column-
level collation). We do understand that it is optimised for OLTP, but
are trying to make the best of it for now. (Plus we have looked at
embedding MonetDB, and while it looks cool and is designed for OLAP,
it does not seem ready for Prime Time just yet...)