firebird-java - Re: Character Set question (practical & philosophical)

Subject	Re: Character Set question (practical & philosophical)
Author	John Craig
Post date	2005-01-21T22:31:22Z

> The problem is that Java provides relatively easy way to handle

different

> encodings - every string in Java is Unicode. This is not the case for
> other platforms. If server would tell the client which charset to use,
> that would significantly complicate life of the application developers.
>

OK, but that's only if the server DICTATES to the client what charset
to use. What I'm proposing is that if the client wants to, it can
NEGOTIATE with the server to use the server's native charset. And,
besides, the question's a Java issue as far as I'm concerned (because
JayBird's a Java driver): the behavior of Jaybird/Firebird does not
necessarily have to be identical the behavior with other drivers.

> Also I have problem convincing people that this feature is needed,

as I am

> not convinced that it is needed. You're the first who presented such
> requirement.
>

Here's the reason I need it (leaving aside the UNICODE_FSS option for
the moment): if you need to ensure round-trip conversions, you can't
switch charsets. For instance, Win1252 has characters which don't
exist in DOS850 and vice-versa. So, if you do a conversion from one
1-byte charset to another, you may not be able to get back what you
originally had. So, if the DB is a known 1-byte charset, you can
reliably convert from that DB charset to Unicode inside Java and you
can also convert back--guaranteed. That's the simplest, most elegant
way to handle the issue.

> Also, personally I do not understand the problem, as you can always
> connect with UNICODE_FSS client encoding and, if I'm correct, all
> character sets will be correctly converted into that encoding. You might
> experience some performance loss, as all clients will ask server to

do the

> conversion, but I do not see any other consequences of this

decision. The

> only problem you will get with NONE encoding, but that's completely
> different issue.

I suppose this is true for the general case, and it'd probably even
work for my situation where I need to be able to reconstruct in the
client program the exact sequence of bytes that appeared in a
char/varchar column in some special cases. But why should the server
have to do the extra conversion when the driver can do it without the
extra step of going to UTF-8 before handing the data bytes off to the
driver and then to UCS2 inside a getString() call within the driver?
Performance is important: that's one of the main factors that drives
the choice of a DBMS. However, with your suggested approach, if I need
to deal with the bytes as they appeared on the DB, knowing the DB's
native charset, I can get it back because there's nothing lost in the
change from a 1-byte charset to UTF-8 that wouldn't be lost going
directly to UCS2--so, functionally, this isn't a bad approach for me.
I'd read the codepage from the DB and simply use Java conversions to
switch back to the appropriate byte string. And, if I'm just a total
odd-ball here, I suppose I may have to live with the slight
performance hit.

The proposal to allow the client and server to use the server's
charset may indeed not have been floated before partly because most
legacy apps used 1-byte charsets internally and those charsets were
dictated by the environment the app was running in (generally the
native codepage of the OS). With Java, as you pointed out, the
question changes. But that change is parallel to changes in current
OS's (which are about as likely to use Unicode internally as a 1-byte
charset nowadays). Since Java has UCS2 internally and can be installed
to handle the conversions for you, the obvious answer (as I see it) is
to allow the client to tell the server not to bother with conversions,
but still be aware of what the native codepage is so that it can give
the data back to the server over the wire in the server's native codepage.

Applications and DB's that I work with (that I plan to use Firebird in
conjunction with) include information that's encoded in admittedly
unusual and tricky codepages. The only way to reliably handle them is
to treat the text as a byte stream and handle the conversion to and
from UCS2 with an add-on package (Java's not likely to hand the ANSEL
codepage; let alone the two alternate encodings it has). So, the
data's tricky, but it's mostly just plain text and that's how it's
stored on the source DB. To manipulate it, in many cases, I have to
convert it back to the bytes that were stored in the table cell on the
DB and then use an add-on package to do a reliable conversion to
Unicode (naturally, with this performance hit, I'm not anxious to have
others ;-). If I want to copy such data to Firebird, I have to be able
to guarantee round-trip conversion to the specific bytes that were on
the original database (or use your approach which involves the extra
conversion to UTF-8 and just live with it). Any time there's a need to
get the bytes that were on the DB back, you need to know what the
original data encoding was so you can explicitly do that.

To sum up: I recognize that this is something that likely hasn't come
up before. But I don't see that as a reason to say it's not worth
doing. The fact of the matter is that it's a very clean solution to
dealing with alternative encodings in Java (and yes, that's pretty
much all I care about at the moment, but this also applies to a
growing number of other environments as well). And, I wouldn't propose
that a client be forced to use the DB's native charset, but that it
could do so easily without a lot of fuss if that were effective. It's
just a very clean solution to the whole question of how you connect to
the database: if you don't specify an encoding, the driver still gives
you UCS2 when you do a getString(), but you are guaranteed round-trip
compatibility and you only have one conversion--for most folks, it's
an esoteric concept that you can inquire of the DB to determine what
its native codepage is and recreate the exact byte sequence that was
on the database in a char/varchar column if you have occassion to need
to do it. For some of us, it's quite important. But, one thing it
would do is relegate the discussion of charsets to something you
normally don't have to worry about at all (I'm thinking of
documentation and FAQs here).

(Oh, and by the way: with that difference in value from the
select rdb$character_set_name from rdb$database
query: I was SURE it was the same DB--I was just wrong. So now I need
to figure out what's going on with this other DB which seems to think
it's Win1252 according to IBAccess but has UNICODE_FSS in the
rdb$database.rdb$character_set_name column. So, that one was just my
own private oops; as you correctly noted.)

Thanks for your time!

John