Subject Re: Some benchmarks about 'Order by' - temporary
Author m_theologos
--- In Firebird-Architect@yahoogroups.com, "Vlad Horsun" <hvlad@...>
wrote:
>
> > > Yes, having all the select list's fields in sorted order is
very good
> > > thing as it avoid needs to reread it from tables again. This is
strong
> > > point no doubt.
> > >
> > > What i mean is that sorted runs contains whole records and
during
> > > merge phase we must read it from disk and write back merged
larger
> > > runs. Formally we need only keys to compare and merge.
> > Yes, but the whole requested record should be returned to the
caller.
> >
> > If they are not written after keys, how you will fast locate it
in disk?
>
> I have not good answer
>
> Simplest way is to place records in additional file and add
offsets
> in that file to the sort records which will contain now keys and
this
> offsets. We'll get sort time independent from record size but more
> space will needed and record retrieval in sorted order can be
slower.
>
> Regards,
> Vlad
>

(Generally speaking, not a straight answer to Vlad only)

If you choose to do a sort only the keys and the offsets (let's say
that the RDB$DB_KEY will be the record tag, to be simplistic) whe can
do some calculations to see how much memory is taken by this
structure:

RecCount: Key size: RDB$Key: Mem taken (MB):
1.000.000 100 8 103,00
1.000.000 200 16 205,99
1.000.000 200 32 221,25
2.000.000 200 24 427,25
3.000.000 150 24 497,82
4.000.000 150 16 633,24
5.000.000 150 8 753,40
5.000.000 150 24 829,70

Where memory taken was calculated as RecCount*(Key size+RDB$Key)/1024/
1024, CIIW.

Some observations:

1. For _very_big_ tables, and very large sort keys (for me a sort key
of 200 bytes is large, even if someone uses UNICODE_FSS which eats 3
bytes/char), IMHO, we are in a very good situation with the memory
giving that today each server must have (CIIW) at least 2 GB RAM.
2. Because the possibility and the ammount of variance of RDB$Key is
very small (AFAIK, the RDB$Key changes its size only on joins)
compared with the possibility and the ammount of variance of Key size
and/or RecCount, the RDB$Key size has a very small influence in the
memory consumption.
3. A _very_big_ table (above of 2.000.000 recs) has, also, other
problems and, generally speaking, in most of the cases it must be
redesigned (partitioned - partitioned? When we'll have partitioning
in Firebird? -IMHO, a cool feature).
4. Of course, a easy and a big improvement vould be if you'll work
not with let's say with Varchar(100), but with the actual length of
the string, reducing very much the ammount of memory needed.
5. Tommorow we'll have (I hope) 64bit of Firebird which will break
the 2GB memory/process.


So, another table:

RecCount: Key size: RDB$Key: Mem taken (MB):
1.000.000 200 8 198,36
1.000.000 300 8 293,73
1.000.000 400 8 389,10
1.000.000 500 8 484,47
1.000.000 600 8 579,83
1.000.000 700 8 675,20
1.000.000 800 8 770,57
1.500.000 200 8 297,55
1.500.000 300 8 440,60
1.500.000 400 8 583,65
1.500.000 500 8 726,70
1.500.000 600 8 869,75
1.500.000 700 8 1.012,80
1.500.000 800 8 1.155,85
2.000.000 200 8 396,73
2.000.000 300 8 587,46
2.000.000 400 8 778,20
2.000.000 500 8 968,93
2.000.000 600 8 1.159,67
2.000.000 700 8 1.350,40
2.000.000 800 8 1.541,14

What I'm trying to say with these numbers?

IMHO, if you'll do this small change in the sorting engine, you'll
keep all the data in the memory so isn't necessary to cope so much
with the case in which the memory is exhausted and what we need to do
in such a case. Perhaps is better to imporve the in-memory sorting
rather than put all our efforts in building a very good
SORT_read_block and SORT_write_block. Anyway, this will be a rather
marginal case.

HTH,

m. Th.