Subject RE: [Firebird-Architect] blobs causes table fragmentation.
Author Leyne, Sean
Roman,

> Not each access to the record means access to the blob. If in order to
> process a request engine needs to evaluate condition on the row data
> (for example no index can be used for it), the bigger record is, more
> pages you have to read.

But that (bigger record = more pages read) is true regardless of where
the data is stored -- this is why the engine supports large page sizes.


> If you have a lot of relatively small BLOBs, each SELECT that requires
> access to the record would need to do twice as more disk IO as
> compared to the situation where BLOB is stored in separate table with
> an integer primary key.
>
> Current behavior is even more flawed when we consider the API. When we
> do want to improve the performance for small BLOBs that are already
> loaded into memory, why do we require additional network call to
> access BLOB's content? Why don't we send its content right away to the
> client? Currently we fetch records that contain BLOB ids, and then for
> each ID we have to load BLOB segments (in this case that is usually
> one segment)...
>
> Also we believe that access to the BLOB content would happen faster
> because the page is in cache. In Classic with default cache size of 75
> pages that means only cache hit when SELECT fetches less then ~600
> rows described above. For bigger result sets (I assume we have cache
> with LRU policy for replacing pages in it) cache does not play any
> role, because when we finish fetching data, first page is thrown away
> when the record 601 is fetched. Then we start accessing BLOBs starting
> the first row, load page 1 by throwing away page 2 (at that point its
> the oldest one), then we continue to page 2 by throwing away page 3,
> and so on. I suspect that the situation with SuperServer is more or
> less similar under heavy load. Anyway effect of cache hit can be
> completely neglected because of network call latency of 5-10 ms per
> call, you can't fix this (and I suspect that additional disk IO gives
> comparable delays). So we loose on each SELECT that involves access to
> the record regardless whether BLOB is needed or not, but what do we
> actually earn on small BLOBs?

I agree that the current situation is less than ideal.

But I need to point out: The problem you describe is with the API/wire
protocol! Not with the manner in which the data is stored in the
database!!

If a new wire-protocol allowed for the return of Blobs with the row data
(perhaps to a defined size) then there would be no problem with the
current storage approach.

So let's keep our eye on the ball, and understand where the problems
truly exist.


> Probably I would introduce new data type BINARY(n) of maximum size
> 32000 bytes content of which would be stored on the database page
> (that would be VARCHAR type without all character information, like
> charset, collation, etc.) whose content is sent over the wire together
> with the content of the rest of the fields. People that need objects
> without upper bound should use BLOBs that in this case would be always
> stored separately.

I don't have a problem with a new datatype, but I'm not so sure this
will solve anything -- in fact depending on the page size and the BINARY
size I could see a case where disk I/O would *significantly* increase
(degrading performance).

The big negative for a new type is the inevitable user confusion -- what
is the difference between BLOB and BINARY???


Sean