Subject Re: [firebird-support] Varchar size overhead
Author Alexandre Benson Smith
Ivan Prenosil wrote:
>> A bird told me some seconds ago that it is 127 to 2 not 512 to 2 :)
>>
>> I remember that I read that a very long time ago, but I don't know were
>> it's documented, IIRC was a message in this list.
>>
>
>
> e.g.
> http://www.volny.cz/iprenosil/interbase/ip_ib_strings.htm
>

Well... In a quick search I did not found the message I was after...

But so far I found information like:

1.) 512 -> 2 (from my memory) wich is clearly wrong, since one byte
would be used for the quantity and another for the repeating byte, there
is no way to store 512 in one byte :)
2.) 127 ->2 (from a friend) Don't know if it's right, it could be, but
why just 127 and not 128 ? (some bits/values for a flag or something
like this ?)
3.) 128 -> 2 (Ivan) makes sense, but why not 256 ? one bit for a flag or
something ?
4.) 256 -> 2 (Ann Harrison, message below), makes sense so a full byte
is used to represent the quantity, but doesn't need a kind of
flag/special value to indicate that the following values is a repetition
instead of an ordinary value ?

part of the bellow message relevant to this thread:
"The data portion compresses to 2 bytes for every 256. So an empty

varchar(512) compresses to 6 bytes."

Perhaps that "512" was what I mixed on my memory...


see you !

--
Alexandre Benson Smith
Development
THOR Software e Comercial Ltda
Santo Andre - Sao Paulo - Brazil
www.thorsoftware.com.br



-------- Original Message --------
Subject: Re: [firebird-support] BLOB performance
Date: Fri, 21 Jan 2005 14:39:51 -0500
From: Ann W. Harrison <aharrison@...>
Reply-To: firebird-support@yahoogroups.com
To: firebird-support@yahoogroups.com
References: <csr0v6+atb3@...> <41F12B44.8060109@...>
<017201c4ffd8$798b07f0$b5bddec3@...>



Alexey Kovjazin wrote:

> May be, I am wrong, but I think that BLOB reading and writing becomes
> expensive when BLOBs are small and placed on the same data page with other
> columns, am I right?

There's a difference between something that affects performance and
something that's "expensive". The writer was comparing the cost of
storing blobs in a record with storing them in another table or storing
them in separate files. Clearly (at least to me) storing blob data is
external files is a performance disaster.

The second alternative that the customer suggested is storing blobs in a
separate table. Reading a blob by blob id requires checking a pointer
page and reading a data page. In most cases, both those pages are
already in cache from looking up the record. Even if the data page has
to be read, lookup by blob id is faster than finding a record in a
different table by key value - even primary key - reading that record,
and tracking down the blob in that table.

Neither of the alternatives that the writer suggested are better than
using a blob in a record. You've suggested a third alternative, storing
the blob as a varchar. You want to look at two effects: 1) the cost of
getting the blob data when it exists, and 2) the effect of having the
blobs on page when retrieving record (not blob) data.

Certainly, storing the blob as a varchar eliminates one round trip from
the client when the data exists. However, storing blobs as varchars
does not significantly reduce the record density. In fact, it may
increase the record density.
>
> So, my assumption why small BLOBs are slowing down:
>
> 1) To read BLOBs engine needs to read page twice (in second time it will be
> in cache, of course) because different APIs are using to get non-BLOB and
> BLOB columns

As above, reading a page in cache is remarkably quick.

>
> 2) Small BLOBs are not small from the point of view of data page - when
> BLOBs occupy big % of data page, so fewer records can fit on page (the
> situation looks worse when we have large fragmented records and small BLOBs)

The data has to go somewhere. If you store it as a varchar, it takes
the same amount of space as if you store it as a blob, ignoring overhead
and compression. If the blob occupies a large part of a page, putting
the same data in the record will make the record occupy almost exactly
the same space. The blob space is packs more densely, so using blobs can
actually increase the number of records on page.

Overhead and compression cut in different ways.

Varchars have two bytes of overhead - the two length bytes. Blobs take
up eight bytes in the record, whether they exist or not, and another 60
bytes or so for the record header, blob header, and page index, if they
do exist.

Both blobs and varchars are compressed as part of the record so the
actual compression depends on what's on either side if them. An empty
varchar consists of two bytes of 0x0, indicating the length, plus n
bytes of spaces. An empty blob consists of eight bytes of spaces.

Assume the preceding field in the record ended with spaces - reasonably
likely - and the succeeding field is a varchar. The length portion of
the varchar compresses like to two bytes, one byte containing the value
-2, indicating that the next two bytes are identical, then a byte of
0x0. The data portion compresses to 2 bytes for every 256. So an empty
varchar(512) compresses to 6 bytes.

The eight bytes of 0x0 that represent an empty blob compress to a length
byte of -8 and a data byte of 0x0 - 2 bytes.

Unless the varchar is preceded by a byte of 0 or followed by a space,
any compressed empty varchar is larger than an empty blob. If the blob
is followed or preceded by a byte of 0, it compresses to nothing.
Varchar compression gets worse as the varchar gets longer - the data
portion of a 32767 byte varchar requires 256 bytes after compression.

So in the case of a value that's rarely present, the blob wins big.

What about medium small blobs that do exist?

Primary records are never stored as fragments, unless the compressed
record is larger than the page size. Assume you've got a record thats
about 1/3 of the page size plus a string that's about 1/4 of a page. If
the string is a varchar in a record, you'll get 1 record per page. If
the string is a blob, you'll get two records plus a blob on some pages,
and one record plus 2 blobs on other - on average 1.5 records per page.

Certainly you get more records per page if each record has less data,
but if you've got to store the data, you're better off sorting it as
blobs on page, given the current data architecture.

It might make sense to have separate blob pages and data pages for each
table, but that's a major ODS change.


>
> As a summary, I did not recommend use BLOBs for small data when they may fit
> on data page - VARCHAR in this case looks better.
>
> Please resolve my doubts - was I wrong?

I think so.


Best regards,


Ann