Subject Re: [Firebird-Java] Approaches at JB-to-FB conenctions regarding charsets
Author Mark Rotteveel
On 29-6-2012 13:53, the_a_rioch wrote:
> Pre-condition.
> Do not treat as NULL is zero.
> Do not treat as non-specified charset is same as charset explicitly set to NONE. That is not the same.
>
> Now repeating proposals:
>
> P1: do two-phase connection to query database default charset (or even set of all used charsets), choose charset broad enough to support all that data (win1251 or UTF8 or whatever). Reconnect.
>
> Drawback: on long lines where TCP connection takes long time - that would double the delay.
> Answer: Those who are really affected, would learn to fix charset in JDBC URL. Those who are not would get best possible charset at price of barely noticeable delay.

Because of the performance implications this is in my opinion not a good
option.

> P2: do try UTF8 connection. If it happened to be FB1.x w/o UTF8 support - re-connect with fallback to some connection wit hall those problems.
>
> Drawback 1: those with FB1.x would get potentiallyslow two-phase connection.
> Answer 1: They can fix it by specifying encoding in URL. That would stimulate them for safer practice. Also that would stimulate them to upgrade to FB2 and lessen support efforts.

As we are dropping explicit support for Firebird versions 1.0 and 1.5 in
Jaybird 2.3 this is not really a big problem. However a big drawback is
if somebody has been using a characterset NONE databases with connection
characterset NONE all these years (or actually (almost) any database
characterset combined with connection characterset NONE). In the current
situation you can be blissfully unaware of this situation and everything
will just seem to work in almost all cvases (until your local system
encoding changes).
If we would default to connect with UTF-8 either you get all kinds of
transliteration errors or data is now stored in two different encodings:
in other words you introduce logical data corruption.

> Drawback 2: it might inflate data sent over net up to four times comparing with SBCS.
> Answer 2: Those who are really affected and can use SBCS, would learn to fix SBCS charset in JDBC URL. Those who are really affected and cannot use SBCS, have little choice anyway, maybe three-byte UNICODE_FSS if any at all. Those who are not affected would get safe reliable charset at price of higher equipment load, which i think is bearable.

I assume that with SBCS you mean single byte character sets?

> P3: do whatever charset assumption you wish. whatever. absolutely. Just report it to server. If you gonna transcode Win1251<->UTF16 - then tell server u expect WIN1251 data over connection, so that server would adapt and apply safety checks if its data fit into WIN1251. If you gonna transcode KOI8-R<->UTF16, then open KOI8-R connection so that server can adapt. If you stick wit h7-bit ASCII aka LATIN1 then... etc.
>
> Drawbacks: already inconsistent legacy databases, that used to work consealing their inconsistency, would fail to work and ask administrator to mend them.
> Answer: "Crash early" policy. If it is inconsistent - the earlier it be fixed - the less are chances for data loss. Or at least the less is amount of lost data.

The problem is that when the database charset and local system encoding
are different when using NONE, in most cases you won't be able to detect
the logical data corruption that will occur when Jaybird changes its
behavior, so Jaybird will have no way to fail early.

> P4: deny connections without explicitly specified charset. If NONE is specified explicitly - do NO conversion at all, as told in FB docs.
>
> Drawback: some databases would stop to work until admins would read manuals and learn to specify it.
> Answer: If current practice is fragile and unreliable, it is good thing to enforce change of practice. Before data would actually be lost due to environment change and you would be blamed not for temporary outage, but for complete disaster.
>
> Post-condition: if NONE would still be treated as "do transcode guessing by local OS and local language", then RAW-BINARY charset is good to be introduced. Meaning exactly complete 100% ban on any transcoding ever. Data from FDB file should pass to end-developer's code and back without single bit changed. Expect developer to only utilize this in extreme condition when he needs it. Low level surgery.

Not a real option because, as I said in my previous mail: you have to
provide Strings for CHAR and VARCHAR in JDBC and therefor you have to
select an encoding, that is simply the way Java works there is no such
thing as a raw string (and if you would have to consider such a thing to
exist that will be the case where you use the local system encoding,
which is exactly what Jaybird is doing now).

This is also the reason why I propose to have Jaybird use the database
characterset for encoding/decoding when connecting with NONE / or no
explicit character set as that is the only viable alternative to get it
to work in Java. Unfortunately that actually can still have the
potential for logical data corruption.

For example:
You have a WIN1252 database (or any other 1-byte characterset excluding
ASCII).
Now assume your Java application is running on a system with local
system encoding WIN1251 and you connect without explicit characterset or
with NONE. This will make that application see and use the database *as
if* it is a WIN1251 database without any problems.

Changing the way no explicit or NONE characterset works in Jaybird will
break that application.

> P1 and P2 are my favorite options.
> P3 is least work.
> P4 is most rigid and enforcing of safe practices.
>
> I consider absolutely impossible state when server and client have DIFFERENT expectations about transcoding maps and their engaging or bypassing. That is data loss. Today or tomorrow. That is really to be avoided at all costs. Better slow than halted, better halted than messed.

Well, unfortunately that is the state we are in. I agree it is a bit
annoying, but it is simply a situation which exists and has existed for
at least 10 years in the case of Jaybird. We can not simply throw away
compatibility for programs (or break them) which are currently running
without problems with 'wrong' connection charactersets.

> You also claimed that my proposal would badly break many real apps out there. I was interested to know. I asked you of practical example of such app. I was thrilled to know what practical usecases has to be taken into account when planning charset-related behavior.

See above. Changing the behavior can break applications which are now
running blissfully unaware of their incorrect use of charactersets
without problems (granted they will also break if the local system
encoding would change).

The funny thing is, your problem with TrackStudio probably wouldn't have
occurred if they hadn't mentioned that you needed to create it as a UTF8
database. It probably would simply have worked if the database had been
created with database characterset NONE ;)

Mark
--
Mark Rotteveel