firebird-java - Re: Approaches at JB-to-FB conenctions regarding charsets

Subject	Re: Approaches at JB-to-FB conenctions regarding charsets
Author	the_a_rioch
Post date	2012-07-03T09:39:18Z

> > P1: do two-phase connection to query database default charset (or even set of all used charsets), choose charset broad enough to support all that data (win1251 or UTF8 or whatever). Reconnect.
> >
> > Drawback: on long lines where TCP connection takes long time - that would double the delay.
> > Answer: Those who are really affected, would learn to fix charset in JDBC URL. Those who are not would get best possible charset at price of barely noticeable delay.
>
> Because of the performance implications this is in my opinion not a good
> option.

Are there some statistics, how long takes establishing a connection and hence what would be the penalty for two-phase mode ?
Also, connecting only takes a small share of overall connection time. Even adding few seconds to this would not affect established connection speed.

Also i remember i made a modified P1bis:

a. wire protocol is extended to query/specify charset after connection is made
b. if client connects without specifying charset, then connection is "restricted"
c. in restricted mode client can
c.1: specify choosen connection, switching connection to full mode
c.2: ask server which charset it considers broad enough (least common charset)
c.3: read metadata, if client wants to choose a charset w/o server hints (say, want to reduce network load and choose some SBCS): only RDB$... tables and views, only 7-bit ASCII strings and numbers, only read, no side-effects - no UFDs nor SPs,

This would need both client and server orchestrated update.
This would add single extra after-connect roundtrip, negotiating a charset, which is comparable to your proposal of covertly converting to DB default charset (which still would be needed to query)

> > P2: do try UTF8 connection. If it happened to be FB1.x w/o UTF8 support - re-connect with fallback to some connection wit hall those problems.

> if somebody has been using a characterset NONE databases with connection
> characterset NONE all these years
> you can be blissfully unaware of this situation and everything
> will just seem to work in almost all cases

until one not so lucky day...

if one is playing game of chance with db consistency, it is not bad thing to enforce safety rules.
the earlier they are enforced, the less data are at stake of damage

> If we would default to connect with UTF-8 either you get all kinds of
> transliteration errors or data is now stored in two different encodings:
> in other words you introduce logical data corruption.

Yes, transliteration errors like http://www.trackstudio.ru/forum/download/file.php?id=212
User would see it, panic and call HelpDesk instead of working noe the less - so, no data corruption. Until admins would mend the connection mode (even at least ugly and dirty fixing ?encoding=NONE) db would not be usable for humans,so they would not enter new data.

> I assume that with SBCS you mean single byte character sets?

sure

> > P3: do whatever charset assumption you wish. whatever. absolutely. Just report it to server.

> The problem is that when the database charset and local system encoding
> are different when using NONE, in most cases you won't be able to detect
> the logical data corruption that will occur when Jaybird changes its
> behavior, so Jaybird will have no way to fail early.

To me it looks quite opposite.
http://www.trackstudio.ru/forum/download/file.php?id=212
Users would instantly detect and "cry early"

>
> > P4: deny connections without explicitly specified charset. If NONE is specified explicitly - do NO conversion at all, as told in FB docs.

no critical comments from you

> This is also the reason why I propose to have Jaybird use the database
> characterset for encoding/decoding when connecting with NONE / or no
> explicit character set

But you do not want to do it in two-phase way. Instead you want to deviate from negotiated connection status and do some black magic behind the scenes.
Also you want to treat same way two different situations, when user is unaware of encodings and does not specify any, and when user is aware of encodings and purposedly fixes NONE.

I think #2 is but ignoring user's will.
I think #1 is patching exceptional non-straight case with even more exceptional one. More and more gotcha's.

in MySQL3 one could INSERT INTO T(PK, ...).... VALUES (NULL, ....) meaning that server should auto-inc PK for him. Nice corner cutting for simple web PHP services.
imagine that later MySQL would not abandon that shortcut, but instead introduced one more NULL or one more INSERT. Would you consider that fix or making things even more unexpected for standard SQL users and even less compatible?

To address deviation with yet another deviation makes the overall structure less predictable and more fragile.

For example, now the behaviour would critically depend upon installed client library version. All your arguments against P2 and P3 are applied to such a change, but in a worst form: now server has no methods to alert of impossible transliteration required. In P2 there was no such thing, UTF-8 could absorb any charset, and in P3 server could detect incompatible charsets and throw. This your proposal strips server of this ability while keeping the situation possible.

> For example:
> You have a WIN1252 database (or any other 1-byte characterset excluding
> ASCII).

Hmmm... if i encoding-ignorant admin, then it more probably would be created with NONE charset generally set.

> Now assume your Java application is running on a system with local
> system encoding WIN1251 and you connect without explicit characterset

All my Order By and case-ignoring text searches are doomed :-)
Frankly, i can hardly imagine any reach database running without server-side sorting at all. But okay, let's go with that example.

...and then i get Mac or Linux box to the network, with the same Java app but different "default" charset for Russian.
Uh-oh....

I still don't know what would happen on Linux. TS claim to only use UTF-8 Linux, yet they believed that KOI8-R is only charset FB supports for Russian. If that was not wrong russian charset was taken by default for Russian, during server installation or Java installation or whatever, then i can't imagine how that could be.

> or with NONE.

Different usecase. I proclaimed my will to do low-level surger, disable safety regulations and shoot my feet. Quite different.

> This will make that application see and use the database *as
> if* it is a WIN1251 database without any problems.

Except for absolutely screwed sorting and partially text search
I wonder if i would send then explicit statement like "EXECUTE BLOCK" with russian text constants inside, what would be...

If i would create new table then (software version updating and adjusting schema), would it inherit WIN1252 of database or WIN1251 of de-facto connection mode or NONE of de jure connection mode...

> Changing the way no explicit or NONE characterset works in Jaybird will
> break that application.

In a way that users would see garbage, cease operations and request admin to fix it.
Admins either does it ASAP or rolls back upgrade and schedules it for later.
Anyway, now both users and admins are aware they are in danger zone.

> > P1 and P2 are my favorite options.
> > P3 is least work.
> > P4 is most rigid and enforcing of safe practices.
> >
> > I consider absolutely impossible state when server and client have DIFFERENT expectations about transcoding maps and their engaging or bypassing. That is data loss. Today or tomorrow. That is really to be avoided at all costs. Better slow than halted, better halted than messed.

> Well, unfortunately that is the state we are in. I agree it is a bit
> annoying, but it is simply a situation which exists and has existed for
> at least 10 years in the case of Jaybird. We can not simply throw away
> compatibility for programs (or break them) which are currently running
> without problems with 'wrong' connection charactersets.

The fix would only require JDBC URL change if database is consistent still.
If it is not consistent already (like in ur example above) then it should be fixed.
And if mending URL is too hard for them - there is always JB 2.1.6

> running blissfully unaware of their incorrect use

I can't consider it a benefit.

They are not just "running", they are storing more and more of potentially inconsistent data.

> of charactersets
> without problems (granted they will also break if the local system
> encoding would change).

or anything else, we even can imagine today.

your proposal to report server NONE but use a different encoding, this time taken from DB rather than from local OS, would break them in the same way.

Now, i think that clients are expected to behave similar to each other, since they are all official.
So let's imagine your proposal was ported to .NetProvider and fbclient/fbembed.
Now imagine someone configured Jaybuird to use fbclient/fbembed.
He does not specify connection charset, so fbclient/fbembed does covert transcoding while pretending they use NONE, then Jaybird seconds and makes yet another transcoding while pretending to use NONE.
"Thames, ^W
"Double conversion, sire!"

> The funny thing is, your problem with TrackStudio probably wouldn't have
> occurred if they hadn't mentioned that you needed to create it as a UTF8
> database.

And they had not. But they mentioned UTF8 in other parts of the program, and i wanted DB to match.

They just did not know Firebird had those charsets!
It was critical info for non-English reliability, and it passed behind the radar. Critical info is moved into Appendix D, as if Appendices and fineprints are the 1st thing to read.

I can suggest one more scenario, they document DB creation using some GUI tool, which has a field for charset. Let it be FlameRobin/Linux and KOI8-R they told about. Now some Windows user takes FlameRobin/Win and replicates the screenshot. What do we have ? inconsistent database claiming it is KOI8-R but having Win1251 texts stored.

> It probably would simply have worked if the database had been
> created with database characterset NONE ;)

No, i would not want THAT to work.

That is not desktop application, that is a www service, that is supposed to be moved here and there. I would fiill it, set it up, then i would move it to unix server and it all breaks up ?
MySQL way. Do not want.