firebird-architect - RE: [Firebird-Architect] UTF-8 Everywhere

Subject	RE: [Firebird-Architect] UTF-8 Everywhere
Author	IBO Support List
Post date	2014-01-17T21:08:14Z

Ann,

You brought out a key distinction that I missed. My apologies for the confusion.

Going with UTF8 uniformly in terms of how things are stored and buffered internally is certainly something I wouldn't discourage, assuming that the external interfaces will all still be supported via transliteration, etc. for legacy API support.

Having a strictly UTF8 core could radically simplify many things. I see it as simply a matter of where you want to push all of the complexity. I'd like to hear more about what all the trade-offs would be.

Jason Wharton

www.ibobjects.com

From: Firebird-Architect@yahoogroups.com [mailto:Firebird-Architect@yahoogroups.com] On Behalf Of Ann Harrison
Sent: Friday, January 17, 2014 1:58 PM
To: Firebird-Architect@yahoogroups.com
Subject: Re: [Firebird-Architect] UTF-8 Everywhere

Jason Wharton <supportlist@...> wrote:

I'm tending to think that adopting a UTF8 only approach is a step backward
...

As for everyone else still dealing with windows
wide strings, codepages, etc., this simply imposes a potentially major
rewrite of their applications to conform to this new requirement. Legacy
support is always a factor to consider.

I hope that the proposal was for UTF8 internally - storage, sorting, manipulation -

with transformation to and from the declared character set on output and input.

There should be no changes to applications.Having a single internal character

representation simplifies comparisons, and more important, greatly reduces

the number of collations - one per desired character sequence rather than one

for each character set that express the language.

If we want to talk about a step forward in flexibility, I suggest you
consider adding in a universal string where you can have each record
indicate what charset is being stored. This would allow any of the
registered charsets to be stored on a per-record basis.

Arrg! What possible difference does it make to the user how a character

is stored as long as it arrives at the application in the desired format and

order? If your goal is to have a world-wide phone book, UTF8 is the only

way to go. Checking each record (why not each field?) to see how to

interpret its strings will just slow every character operation and introduce

bugs.

Moreover, at the moment, Firebird relies on preallocated record buffers

for transfers from the compressed storage format for comparisons and

other manipulation. If the record format is unknown at request compilation

time, all buffers would need to be allocated at the maximum possible

size.

Best regards,

Ann