firebird-architect - Re[2]: [Firebird-Architect] blobs causes table fragmentation. TEST+LONG

Subject	Re[2]: [Firebird-Architect] blobs causes table fragmentation. TEST+LONG
Author	Dmitry Kuzmenko
Post date	2004-10-05T07:28:11Z

Hello, Sean!

Tuesday, October 5, 2004, 3:42:07 AM, you wrote:

LS> So let's keep our eye on the ball, and understand where the problems
LS> truly exist.

Ok, I did that test. Full test description follows (LONG).

I used FB 1.5.1 SS on Windows.

Created database with 8K page size.
(Final database size will be 1.8Gb, so be prepared for that size).

First I've created 1 table

CREATE TABLE A (ID INT NOT NULL, NAME VARCHAR(30));

and 3 tables B, C, D with the same structure:

CREATE TABLE B (ID INT NOT NULL, NAME VARCHAR(30), BLB BLOB);

Next I added 2 functions - f_StrBlob (FreeUdfLib) and RandomStrE (my
randomudf).

Data was filled with procedure:
(here was problem - I couldn't fill all data with one call.
solution was to use 10K chunks, and to run proc 10 times).
(maybe you can't run this too, because really there was
1 proc to fill a,b,c and another to fill d).

CREATE PROCEDURE FILLDATA
AS
declare variable i int;
declare variable j int;
declare variable s varchar(30);
declare variable r varchar(8192);
declare variable d varchar(8192);
declare variable e varchar(10000);
begin
i=0;
while (:i < 10000) do
begin
j=gen_id(b, 1);
s=RandomStrE(20, 30);
r=RandomStrE(128, 1024);
d=RandomStrE(1024, 1024);
e=RandomStrE(9000, 9000);
insert into a values (:j, :s);
insert into b values (:j, :s, StrBlob(:r));
insert into c values (:j, :s, StrBlob(:d));
insert into d values (:j, :s, StrBlob(:e));
i=:i + 1;
end
end

As a result (10 times call) all tables filled with 100K records.

Now:

1. table A populated without blobs.
2. table B populated with different size blobs, from 128 to 1024
3. table C populated with blobs 1K size to test will they be placed
in record or not
4. table D populated with blobs 9K that won't fit in 8K database page.

Than I got statistics:

A (128)
Primary pointer page: 136, Index root page: 137
Average record length: 37.41, total records: 100000
Average version length: 0.00, total versions: 0, max versions: 0
Data pages: 957, data page slots: 957, average fill: 70%
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 0
60 - 79% = 957
80 - 99% = 0

B (129)
Primary pointer page: 139, Index root page: 140
Average record length: 46.20, total records: 100000
Average version length: 0.00, total versions: 0, max versions: 0
Data pages: 8668, data page slots: 8668, average fill: 95%
Fill distribution:
0 - 19% = 0
20 - 39% = 1
40 - 59% = 0
60 - 79% = 0
80 - 99% = 8667

C (130)
Primary pointer page: 144, Index root page: 145
Average record length: 46.19, total records: 100000
Average version length: 0.00, total versions: 0, max versions: 0
Data pages: 14286, data page slots: 14286, average fill: 96%
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 0
60 - 79% = 1
80 - 99% = 14285

D (131)
Primary pointer page: 24392, Index root page: 24393
Average record length: 46.20, total records: 100000
Average version length: 0.00, total versions: 0, max versions: 0
Data pages: 1555, data page slots: 1555, average fill: 81%
Fill distribution:
0 - 19% = 0
20 - 39% = 0
40 - 59% = 1
60 - 79% = 0
80 - 99% = 1554

What we got here:

1. 1K blobs ARE placed on data page, if they fit. Yes, they fit.
2. 9K blobs ARE NOT placed on data page since they don't fit
3. table statistics shows exactly DATA pages used, not blob pages.
4. gstat reports record length excluding blob placed on data page.
user can question "what is really takes so much data pages if
my record is so small".

(Interesting point - how will optimizer calculate cardinality
for a, b, c and d? As I know it looks at table pointer pages,
so calculated records will be out of reality for tables b and c).

Lets test queries. Since select count was rejected as defective :-)
I used

select id, name from table

with FETCH ALL option in IBExpert. So, I will check how long
this query will run and how many page reads it will make
(but at that point you can guess that all will be exactly
the same as statistics says).

! each query was run 4-5 times to eliminate cache hit effect,
OS cache hit effect and other fortuities.

select id, name from a

Execute time = 1s 763ms
Avg fetch time = 0,02 ms
Reads from disk to cache = 0
Fetches from cache = 201 971

select id, name from b

Execute time = 2s 073ms
Avg fetch time = 0,02 ms
Reads from disk to cache = 8 705
Fetches from cache = 217 409

select id, name from c

Execute time = 2s 243ms
Avg fetch time = 0,02 ms
Reads from disk to cache = 14 326
Fetches from cache = 228 648

select id, name from d

Execute time = 1s 832ms
Avg fetch time = 0,02 ms
Reads from disk to cache = 0
Fetches from cache = 203 179

Yep. selects on A and D nearly the same by execute time and fetches
from cache. Other queries makes more reads from disk, fetches
from cache and runs longer.

I think the question "is it useful to store blobs outside record"
is closed? But, you can say that 400-500 msec is not a difference.
Yes, for that dummy query. Other queries (joins, etc) will give
greater time difference.

Have I proven my suggestion or not? Sean I'm not intuist, this
is just practical experience. And really I couldn't expect
that this simple suggestion will raize such a discussion.

--
Dmitri Kouzmenko