firebird-architect - Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database

Subject	Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database
Author	Olivier Mascia
Post date	2005-02-28T11:53:38Z

Le 28-févr.-05, à 11:39, Dimitry Sibiryakov a écrit :

> On 28 Feb 2005 at 10:41, Olivier Mascia wrote:
>
>>> If anyone can name a language that (s)he would like to store in FB
>>> but it is not presented in UCS-2, we can consider using UCS-4.
>>
>> That would be a dumb choice (to opt for UCS-4). 32 bits per character
>> is too much than needed. Unicode needs more than 16 bits, but less
>> than 24 bits.
>
> So, you call Linux designers "dumb"? ;-)

Designers ? ;-)

> That's right, UTF-8 has it's advantages. And it can be considered
> as an encoding for network transfer. But using constant-length
> encodings for storage and processing is more convenient, IMHO.

Why our 20 years old Firebird has always used a varying record length
concept, then ?
Constant length at the declaration level (like this column is a 30
characters wide column is important). Constant length at the
implementation level has few consequences since ultimately our disk
storage is also variable length based.

> Address arifmetic is faster than scan, isn't it?

Certainly, but what amount of string handling requiring address
arithmetic does the engine need to do really ?
Except when someone wants a substring of a string of course ?

String handling is handled using string classes. Those classes
typically use dynamic memory allocation already. The only thing they
can't assume is byte-length == characters-count. Scanning a UTF-8
string can be coded to be very fast though. A strlen() equivalent meant
to count the character length of a zero-terminated UTF-8 string is as
fast as one counting the number of bytes in a plain zero terminated
byte buffer. Thanks to the way UTF-8 encode the additional bytes, you
can skip them straight when you hit them (I mean hitting the second
byte of a 3 or 4 bytes UTF-8 character, you can immediately skip by 2
or 3 bytes thanks to the character length being encoding in the upper
bits of all additional bytes of an UTF-8 character.
The same is true when you reverse scan an UTF-8 string. These
facilities should not be forgotten when considering UTF-8. There are
many differences (simplifications indeed) using UTF-8 than using any of
the national specific MBCS.

>> (covers the whole unicode standard), and is the closest to the concept
>> of a stream of bytes terminated by a zero (C-string). The fact that a
>> pure ASCII (7 bits) string in UTF-8 is exactly the same binary as
>> those same bytes in ASCII is also a nice facility.
>
> But as you mentioned in another letter, ASCII is used by less than
> 50% on people on Earth.

Sure. But I only mentionned ASCII compatibility as a side note. Not as
a main point. ;-)
That is not important but is an additional, maybe marginal, facility.

--
Olivier Mascia