Subject | Re: [firebird-support] UTF8 in firebird ? |
---|---|
Author | Lester Caine |
Post date | 2012-01-06T09:57:30Z |
Mark Rotteveel wrote:
M$ took the half way house of just using UTF-16 and seem to ignore the second
word if the character needs it. So on windows we have an even worse mess when
dealing with unicode.
The STORAGE mechanism needs to be able to handle 24bit characters and this
covers the whole of the unicode universe, but offsetting by 3 bytes per
character makes things a lot more difficult to handle. WORKING with unicode
character strings by storing them in 32bit character arrays means that counting
number of characters and indexing by character is easy, and when being handled
in memory this approach does make a lot of sense. The problem comes when trying
to text search on the raw data stored on disk. I don't think any system has yet
found the ideal way of doing that efficiently? Many systems simply store UTF8
and ignore any question as to how many CHARACTERS are stored ... they just count
bytes ... so a VARCHAR(24) can be anything between 8 and 24 characters long :(
Until the applications level has a bit more standardization on unicode I don't
think Firebird is too adrift on what can be achieved nowadays.
--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php
>>> The longer term solution is for the Firebird project to look at itsThis problem has been discussed extensively in the past ...
>>> >> data representation and find something that works better with UTF8.
>> >
>> > yes, at least some options in the database (or in the create statement)
> to
>> > define the size in byte of 1 UTF8 char
>> >
>> > For exemple by default 1 utf8 char = 4 bytes (like it is now) and i can
>> > be able to
>> > customize it to be egual to 1 bytes.
> Then it is no longer UTF-8. What should change is how Firebird handles and
> stores variable length character encodings like UTF-8, because you are
> right: some parts of Firebird treat it as if it is always 4 bytes, and it
> really should not do that.
>
> However redefining UTF-8 to fit your specific wishes is not the right way.
M$ took the half way house of just using UTF-16 and seem to ignore the second
word if the character needs it. So on windows we have an even worse mess when
dealing with unicode.
The STORAGE mechanism needs to be able to handle 24bit characters and this
covers the whole of the unicode universe, but offsetting by 3 bytes per
character makes things a lot more difficult to handle. WORKING with unicode
character strings by storing them in 32bit character arrays means that counting
number of characters and indexing by character is easy, and when being handled
in memory this approach does make a lot of sense. The problem comes when trying
to text search on the raw data stored on disk. I don't think any system has yet
found the ideal way of doing that efficiently? Many systems simply store UTF8
and ignore any question as to how many CHARACTERS are stored ... they just count
bytes ... so a VARCHAR(24) can be anything between 8 and 24 characters long :(
Until the applications level has a bit more standardization on unicode I don't
think Firebird is too adrift on what can be achieved nowadays.
--
Lester Caine - G8HFL
-----------------------------
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php