Subject Re: Ideas anyone?
Author Adam
> Problem:
> - The system is experiencing performance issues.
> - A major new release is coming up and not only is the performance
> to be sorted but also new requirements have arisen.
> Current Performance Issues, likely causes:
> - IN() performance. Appears to use Uses multiple indexes(!) even
> when IN() statement is constants.

Hi Tom,

50 constants is nothing :)

This is a common problem when the queries are automatically generated
based on user input of some form. If you were to grab the query text,
you would see that it does silly things,

select ID, Name
from table
where ID in (1,2,3,4,5,6,7,8,9,...,150)

Now of course if you are writing this by hand, you would use.

where ID between 1 and 150, but this is much harder to do in an
automated tool (just been through a similar process).

To save you some time, here are some important things you should know
about the IN predicate.

One index hit per member:

Your query will hit the index on ID for every member of the ID. It
will not internally optimise it to a set of betweens etc, so that is a
penalty. Offsetting this of course, is after the first read, the index
gets cached, but still it is not good.

1499 limit

where ID in (1,2,3,4,5,.....,2000)

will not work

where ID in (1,2,3,4,5,.....,1499)
or ID in (1500,1501,1502,.....,2000)

will work. Now you might be thinking at this point that a relatively
trivial function can be written to split like this, but there is
another limit.

64K query size

From memory this will kick in somewhere between 12000 and 13000
members in the IN. (and remember that is a lot of unnecessary hits on
the index).

This is how we solved this problem. This is Delphi specific, but I am
sure it can be adapted to any other language (I know you are using
Delphi but someone else may not be).

We have a function along the lines

DatasetToConditionClause(ds : TDataset; dsFieldName : String;
dbFieldName : String);

We are using kbmMemtables, but and dataset descendant (ClientDataset)
could be used.

The first step is to sort and remove duplicates from the dataset.
Using indexed memtables, this can be done very efficiently.

There are two other memory tables created internally in the process.
memSequential and memRandom.

1. if memSequential is empty, the record is removed from the sorted
dataset and inserted into memSequential. A variable holds this number.

2. The next record is checked, and if that record is exactly one
greater than the record in the variable in 1, then it is appended to
the sequential table, and the variable is updated. Otherwise the "run"
has been broken. In this case, the record count of memSequential is
taken, and if it is > 1, then we class that as a significant run, and
build the string

"([dbFieldName]>=[MinValueFromSequential] AND

This is equivalent to a between, but takes up less space. This string
is added to a stringlist, and memsequential is emptied.

Go back to step 1, and repeat until the sorted dataset is empty.

(Dont panic I will do an example)

Then all you need to do is to convert the sequential stringlist into a
series of or statements. The random memory table now needs attention.
These values can not be optimised in any way, so they need to be
converted to an IN predicate. Now of course we remember that there can
only be 1499 records at most, so we make a new IN block for every 1499

So lets see it in action.

Imagine your dataset full of ID values looked like this


This is from experience the most common sort of scenario, you actually
have most values, and the ones that are missing are mainly deleted,
rolled back transactions but the generator was incremented, or
deselected by the user.

The above function would give you the following

(ID>=1 AND ID<=30) OR
(ID>=34 AND ID<=35) OR
(ID>=50 AND ID<=560) OR
(ID>=564 AND ID<=2000) OR
(ID IN (38))

Note that this is only 5 hits to the index, and will keep you away
from the 64K limit for as much as possible. This string is just
substituted into the query before it is executed. While we have not
observed a noticable performance increase over the old function (that
just put everything inside an IN block), it has allowed our software
to handle larger databases because of the 1499 limit.

We have this function in a common unit, and it has dUnit test cases
built for it to prove it works.

The other options I have seen and attempted is to insert the records
into a holding table with the current transactions ID, then to join to
that table. The downside to that approach is the number of inserts per
second you can practically achieve.