Subject Re: [firebird-support] Writing UTF16 to the database
Author Brad Pepers
Ann W. Harrison wrote:
> Adriano dos Santos Fernandes wrote:
>
>>WhatsNew of new INTL is here: http://cvs.sourceforge.net/viewcvs.py/firebird/firebird2/doc/WhatsNew?rev=1.45.2.3&only_with_tag=B2_0_intl&view=auto
>>
>>Allow the use of UTF16 in columns isn't a difficult task but is deactivated because isn't complete.
>>Allow using UTF16 as connection charset is difficult and isn't yet started.
>
> Is it necessary to store different character representations in the
> database? Could we not choose some Unicode representation and store
> only that, translating in and out as appropriate?

Doing so would have many benefits with at least some hurdles:

1. Depending on the Unicode format used internally (UTF-8/16/32) this
could have a high penalty on string sizes. Its not too bad with UTF-8
but it will have at least a slight over-head as compared to a string
stored in an exact encoding (for example accented characters will take 2
bytes with UTF-8 as compared to 1 with an encoding supporting the
accented character directly).

2. Also doing this will likely require that there is a string class that
uses Unicode internally and this would make the class a heavier
implementation that just a more simple wrapper around a char*. I
suspect once such a string class is created, it should be used in all
areas of Firebird rather than having two separate string classes
depending on whether you currently think you need Unicode support or not
but this would have to be thought out more.

3. You would need a new system to convert from/to Unicode from supported
character sets so it would largely replace the existing collation
sequences and such which is a large job I'm sure.

4. Comparison and collation with Unicode is not a trivial problem to
solve. I'm not sure if there is any code out there to already do part
of this but its quite complicated though with potential benefits if its
done right such as having case-less comparisons and being able to handle
any mix of character sets properly.

5. If the key of an index is a string, how is it handled now? I was
under the impression it was stored in a binary format that is expected
to compare properly using a straight binary comparison. If true, I
don't think this will work with Unicode at all but then I also have
trouble seeing how it would work with other character set encodings
either so perhaps this isn't how indexes are used. How is an index on a
string column stored and how is it used by Firebird?

--
Brad Pepers
brad@...