Subject | Re: [Firebird-Architect] Re: [firebird-support] Writing UTF16 to the database |
---|---|
Author | Olivier Mascia |
Post date | 2005-02-28T11:53:38Z |
Le 28-févr.-05, à 11:39, Dimitry Sibiryakov a écrit :
concept, then ?
Constant length at the declaration level (like this column is a 30
characters wide column is important). Constant length at the
implementation level has few consequences since ultimately our disk
storage is also variable length based.
arithmetic does the engine need to do really ?
Except when someone wants a substring of a string of course ?
String handling is handled using string classes. Those classes
typically use dynamic memory allocation already. The only thing they
can't assume is byte-length == characters-count. Scanning a UTF-8
string can be coded to be very fast though. A strlen() equivalent meant
to count the character length of a zero-terminated UTF-8 string is as
fast as one counting the number of bytes in a plain zero terminated
byte buffer. Thanks to the way UTF-8 encode the additional bytes, you
can skip them straight when you hit them (I mean hitting the second
byte of a 3 or 4 bytes UTF-8 character, you can immediately skip by 2
or 3 bytes thanks to the character length being encoding in the upper
bits of all additional bytes of an UTF-8 character.
The same is true when you reverse scan an UTF-8 string. These
facilities should not be forgotten when considering UTF-8. There are
many differences (simplifications indeed) using UTF-8 than using any of
the national specific MBCS.
a main point. ;-)
That is not important but is an additional, maybe marginal, facility.
--
Olivier Mascia
> On 28 Feb 2005 at 10:41, Olivier Mascia wrote:Designers ? ;-)
>
>>> If anyone can name a language that (s)he would like to store in FB
>>> but it is not presented in UCS-2, we can consider using UCS-4.
>>
>> That would be a dumb choice (to opt for UCS-4). 32 bits per character
>> is too much than needed. Unicode needs more than 16 bits, but less
>> than 24 bits.
>
> So, you call Linux designers "dumb"? ;-)
> That's right, UTF-8 has it's advantages. And it can be consideredWhy our 20 years old Firebird has always used a varying record length
> as an encoding for network transfer. But using constant-length
> encodings for storage and processing is more convenient, IMHO.
concept, then ?
Constant length at the declaration level (like this column is a 30
characters wide column is important). Constant length at the
implementation level has few consequences since ultimately our disk
storage is also variable length based.
> Address arifmetic is faster than scan, isn't it?Certainly, but what amount of string handling requiring address
arithmetic does the engine need to do really ?
Except when someone wants a substring of a string of course ?
String handling is handled using string classes. Those classes
typically use dynamic memory allocation already. The only thing they
can't assume is byte-length == characters-count. Scanning a UTF-8
string can be coded to be very fast though. A strlen() equivalent meant
to count the character length of a zero-terminated UTF-8 string is as
fast as one counting the number of bytes in a plain zero terminated
byte buffer. Thanks to the way UTF-8 encode the additional bytes, you
can skip them straight when you hit them (I mean hitting the second
byte of a 3 or 4 bytes UTF-8 character, you can immediately skip by 2
or 3 bytes thanks to the character length being encoding in the upper
bits of all additional bytes of an UTF-8 character.
The same is true when you reverse scan an UTF-8 string. These
facilities should not be forgotten when considering UTF-8. There are
many differences (simplifications indeed) using UTF-8 than using any of
the national specific MBCS.
>> (covers the whole unicode standard), and is the closest to the conceptSure. But I only mentionned ASCII compatibility as a side note. Not as
>> of a stream of bytes terminated by a zero (C-string). The fact that a
>> pure ASCII (7 bits) string in UTF-8 is exactly the same binary as
>> those same bytes in ASCII is also a nice facility.
>
> But as you mentioned in another letter, ASCII is used by less than
> 50% on people on Earth.
a main point. ;-)
That is not important but is an additional, maybe marginal, facility.
--
Olivier Mascia