Subject Re: [Firebird-Architect] Re: The Wolf on Firebird 3
Author Olivier Mascia
Le 02-nov.-05 à 23:11, Roman Rokytskyy a écrit :

>> It's time to accept that we're all part
>> of the same world.
>
> Just that one part of the world accidentaly got 1 byte per char, while
> other needs 2 bytes per char :)

Come on. Each and every system call your beloved program does to the
Win32 API on Windows 2000, Windows XP, Windows Server 2003, and every
other newer version to come for a while spend its time converting
your one-byte chars to two or four bytes per char. Each in-parameter
gets widened and then each out-parameters gets shortened on the return.

We can save some bytes by not storing 2 bytes in the DB when really
one would be enough. But that's a fool's game for sure. Everywhere
else in the OS, time and resources are spent widening, shortening
those strings. On each system call. Think of it. Every SetText to
enter some characters in a GUI field translate your one byte char-set
to a two byte charset. And the reverse when you want to read what
happens to be in the field. The effort to save bytes at the storage
level makes less and less sense as minutes pass by while we discuss
it. Of course, not everybody uses Windows. I do use it myself less
and less each day. Yet, this all unicode choice internally to Windows
NT was a good idea. Nobody complains and even think of it on a normal
business day. (I just complain they chose a kind of utf-16 for
representation, but that's another story.

Internally (read in memory) all string handling of FB might even be
made 32-bits wide per char. Terribly simplifying a lot of things.
Only converting to/from utf-8 when storing/retrieving values - a
process possibly intermixed with some compression for the storage. 32
bits values is the most common simple unit of memory today. Handling
bytes forces the compiler to use more machine language bytes per
instructions than handling 32 bits values. Even at the hardware
level, your average Pentium will read 4 bytes from the memory at at
time, even though you only want one, so adding masking instructions
to remove the unwanted portion when needed to. Sure, one can say the
reverse : when needing 4 consecutive bytes, one read will get them at
once. And handling strings in memory as strings of 32 bits characters
would consume more memory at runtime. Yeah. Though how many megabytes
are there in the cheapest PC you can buy today ?

I admit that utf-8 storage is obviously more bytes per string on the
average than specific dedicated character sets. Though whoever needs
to cope with internationalization of their software have to define
database columns such that they can store multiple different
character sets. My accounting program needs to be able to store the
customer name, wether this customer name is an english name, a french
name, a czech name, a russian name or an arabic name. Would you want
to force me to have 6 or 7 different columns for the customer name,
each declared with one distinct character set, just for the sake of
"saving bytes" by using one-byte charsets on the storage ?? I say no.
Let me save trees and energy in some other more effective way. ;-)

--
Olivier