Subject Re: [Firebird-Java] Approaches at JB-to-FB conenctions regarding charsets
Author Roman Rokytskyy
> NONE means "no transcoding", pass binary data non-changed.
> It means "there is NO language context"
> It means "that is supposed to be raw binary data and client takes all responsibility how to find text there, if it exists at all"
>
> That is what one can get in docs for FB server. Overriding server docs in client library is a gotcha. Client libraries are expected generally to not override server rules. If server says "NONE is no-transcode", the same is expected from library.

That might work for C++ where you have access to raw data, in Java it
has to be converted to Unicode. So JDBC driver has to make some
assumptions. Period.

> P1: do two-phase connection to query database default charset (or even set of all used charsets), choose charset broad enough to support all that data (win1251 or UTF8 or whatever). Reconnect.
>
> Drawback: on long lines where TCP connection takes long time - that would double the delay.
> Answer: Those who are really affected, would learn to fix charset in JDBC URL. Those who are not would get best possible charset at price of barely noticeable delay.

Possible, but complicates the driver significantly.

> P2: do try UTF8 connection. If it happened to be FB1.x w/o UTF8 support - re-connect with fallback to some connection wit hall those problems.
>
> Drawback 1: those with FB1.x would get potentiallyslow two-phase connection.
> Answer 1: They can fix it by specifying encoding in URL. That would stimulate them for safer practice. Also that would stimulate them to upgrade to FB2 and lessen support efforts.
>
> Drawback 2: it might inflate data sent over net up to four times comparing with SBCS.
> Answer 2: Those who are really affected and can use SBCS, would learn to fix SBCS charset in JDBC URL. Those who are really affected and cannot use SBCS, have little choice anyway, maybe three-byte UNICODE_FSS if any at all. Those who are not affected would get safe reliable charset at price of higher equipment load, which i think is bearable.

Works fine for all databases with specified charset (database or
column). Network overhead for Russians, but who cares, possibly some CPU
overhead for the server.

> P3: do whatever charset assumption you wish. whatever. absolutely. Just report it to server. If you gonna transcode Win1251<->UTF16 - then tell server u expect WIN1251 data over connection, so that server would adapt and apply safety checks if its data fit into WIN1251. If you gonna transcode KOI8-R<->UTF16, then open KOI8-R connection so that server can adapt. If you stick wit h7-bit ASCII aka LATIN1 then... etc.
>
> Drawbacks: already inconsistent legacy databases, that used to work consealing their inconsistency, would fail to work and ask administrator to mend them.
> Answer: "Crash early" policy. If it is inconsistent - the earlier it be fixed - the less are chances for data loss. Or at least the less is amount of lost data.

It does not differ from UTF8 case except that you explicitly limit the
characters you can consume. If data is inconsistent, you get an error,
which is ok.


> P4: deny connections without explicitly specified charset. If NONE is specified explicitly - do NO conversion at all, as told in FB docs.
>
> Drawback: some databases would stop to work until admins would read manuals and learn to specify it.
> Answer: If current practice is fragile and unreliable, it is good thing to enforce change of practice. Before data would actually be lost due to environment change and you would be blamed not for temporary outage, but for complete disaster.

Again, "do no conversion" is not possible in Java. But Jaybird has
answer to this already - there is another property which tells how to
interpret NONE-data.

> Post-condition: if NONE would still be treated as "do transcode guessing by local OS and local language", then RAW-BINARY charset is good to be introduced. Meaning exactly complete 100% ban on any transcoding ever. Data from FDB file should pass to end-developer's code and back without single bit changed. Expect developer to only utilize this in extreme condition when he needs it. Low level surgery.

There is no low-lever surgery for Strings in Java. Stop this "no
transcoding" rant.

> P1 and P2 are my favorite options.
> P3 is least work.
> P4 is most rigid and enforcing of safe practices.
>
> I consider absolutely impossible state when server and client have DIFFERENT expectations about transcoding maps and their engaging or bypassing. That is data loss. Today or tomorrow. That is really to be avoided at all costs. Better slow than halted, better halted than messed.
>
> If server expects client to not-transcode data - it should be so.
> If client want to transcode data from WIN1251 - the server should be informed.
>
> Did i expressed my opinion detailed enough and contrast enough this time ?

None of your suggestions will work with heterogenous environment. Even
if Jaybird defaults to UTF8, all Delphi apps will default to NONE, they
won't be able to read that UTF-8 characters. And when they will write
data, Java apps won't read that stuff.

The only solution is to ditch the NONE completely, leaving it for
backward compatibility only.

Roman