Subject | Re: Size of an Interbase table |
---|---|
Author | Ann W. Harrison |
Post date | 2001-12-31T16:03:22Z |
This slightly garbled message describes some of the problems
involved in computing the size of an InterBase database. It
ignores the issues of back versions, fragmented records, and
large blobs.
The message is my response to someone asking for clarification
of an earlier, tongue in cheek message I had sent:
Even the easy part is hard.
What are you asking for? Do you want to know how much space the
actual data requires? Or how much the data plus it's overhead
the record header and the record index take? Or the actual number of
data pages used, including reserved space and page overhead? Or
the total number of pages required to represent the data, including all
the above plus pointer pages, space management pages, and entries in
the rdb$pages table (plus, of course, their overhead)? Or all that plus
the index space?
Having decided the answer to that, figuring out just the first bit -
the length in bytes of the actual data - is really hard.
The easiest way to calculate the size of an existing table
is with gstat or the database analysis available through the server
manager. (They're actually the same thing, I think). From the server
manager tasks menu, invoke "database statistics". In the new window,
from the view menu, invoke "database analysis."
That will give you a printout like this:
T1 (161)
Primary pointer page: 388, Index root page: 389
Data pages: 2523, data page slots: 2523, average fill: 75%
Fill distribution:
0 - 19% = 5
20 - 39% = 1
40 - 59% = 10
60 - 79% = 2302
80 - 99% = 205
The "Data pages" value is the number of pages that hold some data.
Effectively, that value times the page size tells you how much space
the data uses in the database, since the free space on a page can
be used only for other rows of the same table. Multiplying by the
fill factor will give an estimate of the actual data size, and the
amount of space the data will use if you turn off the "reserved space"
allowance and backup and restore the database.
Computing the amount of reserved space is tricky, and my memory is
slightly feeble. I think that eighteen or twenty-four bytes is reserved
for each row stored on the page.
If you're interested in all the space used by a table, you'll have to
check the indexes as well. The same analysis will give you data like
this about each index:
Index T1_IDX1 (3)
Depth: 4, leaf buckets: 1404, nodes: 20000
Average data length: 56.00, total dup: 0, max dup: 0
Fill distribution:
0 - 19% = 0
20 - 39% = 1
40 - 59% = 0
60 - 79% = 3
80 - 99% = 1400
This is a bit trickier, because the statistics apply only to the bottom
level of the index. To compute the bottom level size, either take the
number of leaf buckets (aka bottom level index pages) and multiply by
the page size. To get the total index size, you'll have to estimate the
sizes of the intermediate levels of the index. The top level is one
page, always.
The level up from the leaf level is something like
Number of leaf level pages / nodes per page
Nodes per page is
(PageSize / (Average data length + 6)) / the average fill percent
e.g. 1404 / ( 1024 / (56 + 6)) / .90 = 240
For the next levels up, change the use the value computed for the
previous levels in place of the number of leaf level pages.
e.g. 240 / (1024 / (56 + 6)) / .90 = 17 (since a fraction becomes a
whole page)
(The database I used really ought to have a bigger page size to reduce
the depth of the indexes.)
The six (6) is my best recollection of the index entry overhead above the
leaf level.
If you want to compute the amount of space data will take, that's much
harder. The formula I gave was sort of a joke - neither clever nor
clear. It's obscure because the result is just about unpredictable.
Let me try to answer your questions about the terms, then try to
explain why the question is hard.
Compute the length of each column.
char (n) = n bytes
varchar (n) = n + 2 bytes
date = 8
blob = 8
short = 2
integer = 4
float = 4
double precision = 8
numeric (n, m) = 2 if n < 5
else 4 if n < 10
else 8
decimal (n, m) same as numeric.
Now add the pad bytes for alignment. Char fields are not aligned.
Varchar are aligned on even boundaries. All other are aligned on
boundaries divisible by 4. The row always starts on a boundary
divisible by 4.
Now add the extra bytes for compression. In the maximum sized row,
every 128th byte is a compression byte, starting with the first byte.
Add 16 bytes for the row overhead, plus two bytes for the row index.
Now, figure in the compression. Any time there is a run of three or
more identical byte values, there's a compression byte and a data byte.
A run of up to 127 identical bytes will compress to two bytes. The
compression goes across columns, so if you have a string of 5 character
fields each 25 bytes long and all empty, they'd be represented by
two bytes.
size. Obviously, it would be better to compute the size of
each table.
expected size of data
for your database that makes the computation above produce the database
size. It's computed by dividing the previously calculated number
into the size of the database.
You had asked about computing the table size from Delphi. A rough
approximation could be done by finding the maximum row size and multiplying
it by the number of rows. As for indexes, I can't even begin to
suggest a way of estimating because index keys are subject to prefix
and suffix compression (except for the first key in each bucket).
Gstat will occasionally produce an index description like this:
Index C4_T1 (1)
Depth: 3, leaf buckets: 128, nodes: 20000
Average data length: 0.00, total dup: 16976, max dup: 15
Fill distribution:
0 - 19% = 1
20 - 39% = 0
40 - 59% = 0
60 - 79% = 0
80 - 99% = 127
Notice that the average data length is zero. That's the effect of
prefix compression on indexes with lots of duplicates. The actual
key is 25 bytes long.
Frankly, gstat is the best source of the size of existing tables.
Estimating the eventual size of tables is a very hard problem, since
it depends on the compression factor for the actual data. That's one
reason why InterBase has never required preallocating table or index
space.
Sorry not to be more help -- if there's something specific you're
trying to do, I may be able to come up with better formulas based
on more data.
Ann
involved in computing the size of an InterBase database. It
ignores the issues of back versions, fragmented records, and
large blobs.
The message is my response to someone asking for clarification
of an earlier, tongue in cheek message I had sent:
>I'm trying to use your equation :Probably not. And the question is more complicated than you may think.
> ((TC * TS * RC) + (IC * (AK + PT) * RC)) * K
>but is it possible to use it to calculate the octets size of each table of
>my database ?
Even the easy part is hard.
What are you asking for? Do you want to know how much space the
actual data requires? Or how much the data plus it's overhead
the record header and the record index take? Or the actual number of
data pages used, including reserved space and page overhead? Or
the total number of pages required to represent the data, including all
the above plus pointer pages, space management pages, and entries in
the rdb$pages table (plus, of course, their overhead)? Or all that plus
the index space?
Having decided the answer to that, figuring out just the first bit -
the length in bytes of the actual data - is really hard.
The easiest way to calculate the size of an existing table
is with gstat or the database analysis available through the server
manager. (They're actually the same thing, I think). From the server
manager tasks menu, invoke "database statistics". In the new window,
from the view menu, invoke "database analysis."
That will give you a printout like this:
T1 (161)
Primary pointer page: 388, Index root page: 389
Data pages: 2523, data page slots: 2523, average fill: 75%
Fill distribution:
0 - 19% = 5
20 - 39% = 1
40 - 59% = 10
60 - 79% = 2302
80 - 99% = 205
The "Data pages" value is the number of pages that hold some data.
Effectively, that value times the page size tells you how much space
the data uses in the database, since the free space on a page can
be used only for other rows of the same table. Multiplying by the
fill factor will give an estimate of the actual data size, and the
amount of space the data will use if you turn off the "reserved space"
allowance and backup and restore the database.
Computing the amount of reserved space is tricky, and my memory is
slightly feeble. I think that eighteen or twenty-four bytes is reserved
for each row stored on the page.
If you're interested in all the space used by a table, you'll have to
check the indexes as well. The same analysis will give you data like
this about each index:
Index T1_IDX1 (3)
Depth: 4, leaf buckets: 1404, nodes: 20000
Average data length: 56.00, total dup: 0, max dup: 0
Fill distribution:
0 - 19% = 0
20 - 39% = 1
40 - 59% = 0
60 - 79% = 3
80 - 99% = 1400
This is a bit trickier, because the statistics apply only to the bottom
level of the index. To compute the bottom level size, either take the
number of leaf buckets (aka bottom level index pages) and multiply by
the page size. To get the total index size, you'll have to estimate the
sizes of the intermediate levels of the index. The top level is one
page, always.
The level up from the leaf level is something like
Number of leaf level pages / nodes per page
Nodes per page is
(PageSize / (Average data length + 6)) / the average fill percent
e.g. 1404 / ( 1024 / (56 + 6)) / .90 = 240
For the next levels up, change the use the value computed for the
previous levels in place of the number of leaf level pages.
e.g. 240 / (1024 / (56 + 6)) / .90 = 17 (since a fraction becomes a
whole page)
(The database I used really ought to have a bigger page size to reduce
the depth of the indexes.)
The six (6) is my best recollection of the index entry overhead above the
leaf level.
If you want to compute the amount of space data will take, that's much
harder. The formula I gave was sort of a joke - neither clever nor
clear. It's obscure because the result is just about unpredictable.
Let me try to answer your questions about the terms, then try to
explain why the question is hard.
>No, number of tables.
>TC - table count
> Nb of records ?
>TS - table size (average bytes per row)Not easily. You can determine the maximum size of a row this way:
> How to obtain it ?
Compute the length of each column.
char (n) = n bytes
varchar (n) = n + 2 bytes
date = 8
blob = 8
short = 2
integer = 4
float = 4
double precision = 8
numeric (n, m) = 2 if n < 5
else 4 if n < 10
else 8
decimal (n, m) same as numeric.
Now add the pad bytes for alignment. Char fields are not aligned.
Varchar are aligned on even boundaries. All other are aligned on
boundaries divisible by 4. The row always starts on a boundary
divisible by 4.
Now add the extra bytes for compression. In the maximum sized row,
every 128th byte is a compression byte, starting with the first byte.
Add 16 bytes for the row overhead, plus two bytes for the row index.
Now, figure in the compression. Any time there is a run of three or
more identical byte values, there's a compression byte and a data byte.
A run of up to 127 identical bytes will compress to two bytes. The
compression goes across columns, so if you have a string of 5 character
fields each 25 bytes long and all empty, they'd be represented by
two bytes.
>RC - row count (average rows per table)I was giving a very very rough formula for computing database
size. Obviously, it would be better to compute the size of
each table.
>IC - number of indexesFinally, something simple.
>AK - average key lengthYes.
> Gave by GSTAT ?
>PT - pointer portion of keyThat value happens to be four. It's just a built-in value.
> Gave by GSTAT ?
>RC - average row countYes. Or from your own knowledge if you're computing the
> Gave by GSTAT ?
expected size of data
>K - the magic number.Sigh. That was the punch line of the joke. K is the magic constant
> ??
for your database that makes the computation above produce the database
size. It's computed by dividing the previously calculated number
into the size of the database.
You had asked about computing the table size from Delphi. A rough
approximation could be done by finding the maximum row size and multiplying
it by the number of rows. As for indexes, I can't even begin to
suggest a way of estimating because index keys are subject to prefix
and suffix compression (except for the first key in each bucket).
Gstat will occasionally produce an index description like this:
Index C4_T1 (1)
Depth: 3, leaf buckets: 128, nodes: 20000
Average data length: 0.00, total dup: 16976, max dup: 15
Fill distribution:
0 - 19% = 1
20 - 39% = 0
40 - 59% = 0
60 - 79% = 0
80 - 99% = 127
Notice that the average data length is zero. That's the effect of
prefix compression on indexes with lots of duplicates. The actual
key is 25 bytes long.
Frankly, gstat is the best source of the size of existing tables.
Estimating the eventual size of tables is a very hard problem, since
it depends on the compression factor for the actual data. That's one
reason why InterBase has never required preallocating table or index
space.
Sorry not to be more help -- if there's something specific you're
trying to do, I may be able to come up with better formulas based
on more data.
Ann