Subject Re: [Firebird-Architect] A Fresh Look at Collations
Author Jim Starkey
Alexander Peshkoff wrote:
> I also do not see big losses in validation of data, coming from client.
> Networks still seem to be a bit slower compared with CPUs. What is more
> interesting on my mind is how UTF8 strings are planned to be stored in
> database. Will there be compression of text, stored on disk?
>
>
>
Character data doesn't compress at all well with simple schemes. ZLib
compression works well, but is prohibitively expensive -- last time I
checked, it was faster to read extra disk blocks than to decompression
via zlib. There may be better compromises out there, but I'm not aware
of any. Inventing a fast compress/decompress that significantly beats
run length encode is a path to database glory.

NimbusDB uses a self-describing encoded with overloaded type codes that
I developed for Netfrastructure / Falcon. It doesn't try to compress
text, but does a great job with numbers and mostly eliminates byte
counts for variable length strings. Measured on a very large customer
database, the self-describing encoding was about 60% of size of run
length encoding.

Before we leave utf8, something worth noting is that the utf8 can encode
arbitrary unsigned 32 bit integers as byte strings that compare
naturally with close to optimal packing density. The encoding could be
made denser by using all 8 bits for non-leading bytes at the cost of the
ability to validate correct sequences.

--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376