firebird-java - Approaches at JB-to-FB conenctions regarding charsets

Subject	Approaches at JB-to-FB conenctions regarding charsets
Author	the_a_rioch
Post date	2012-06-29T11:53:24Z

> > from #JDBC-257
> > "The connection characterset would remain NONE, but the encoding used in FBStringField would change to match the database"
> >
> > Ehh? Same trick again ? Report de jure one charset but de facto use different one ?
> > You gonna cheat again? You gonna lie to server ?
> > And you think that would - in the long term - end well ?
>
> You are aware that this is exactly the same way that Firebird server
> works.

> If the connection characterset is NONE, Firebird will send the
> content of CHAR and VARCHAR in the encoding of the database AS IS,

yes

> So matching the encoding of the
> database would ensure the exact same representation in String

no

NONE means "no transcoding", pass binary data non-changed.
It means "there is NO language context"
It means "that is supposed to be raw binary data and client takes all responsibility how to find text there, if it exists at all"

That is what one can get in docs for FB server. Overriding server docs in client library is a gotcha. Client libraries are expected generally to not override server rules. If server says "NONE is no-transcode", the same is expected from library.

> on the Java side as in the database which in my mind
> would be the best way to handle it.

If you can handle it.

In my practical case - that lead to database unusable.

> If you have a better idea, make a detailed proposal

I did already. Two times. Or even three. Oh, no, even four times!

Pre-condition.
Do not treat as NULL is zero.
Do not treat as non-specified charset is same as charset explicitly set to NONE. That is not the same.

Now repeating proposals:

P1: do two-phase connection to query database default charset (or even set of all used charsets), choose charset broad enough to support all that data (win1251 or UTF8 or whatever). Reconnect.

Drawback: on long lines where TCP connection takes long time - that would double the delay.
Answer: Those who are really affected, would learn to fix charset in JDBC URL. Those who are not would get best possible charset at price of barely noticeable delay.

P2: do try UTF8 connection. If it happened to be FB1.x w/o UTF8 support - re-connect with fallback to some connection wit hall those problems.

Drawback 1: those with FB1.x would get potentiallyslow two-phase connection.
Answer 1: They can fix it by specifying encoding in URL. That would stimulate them for safer practice. Also that would stimulate them to upgrade to FB2 and lessen support efforts.

Drawback 2: it might inflate data sent over net up to four times comparing with SBCS.
Answer 2: Those who are really affected and can use SBCS, would learn to fix SBCS charset in JDBC URL. Those who are really affected and cannot use SBCS, have little choice anyway, maybe three-byte UNICODE_FSS if any at all. Those who are not affected would get safe reliable charset at price of higher equipment load, which i think is bearable.

P3: do whatever charset assumption you wish. whatever. absolutely. Just report it to server. If you gonna transcode Win1251<->UTF16 - then tell server u expect WIN1251 data over connection, so that server would adapt and apply safety checks if its data fit into WIN1251. If you gonna transcode KOI8-R<->UTF16, then open KOI8-R connection so that server can adapt. If you stick wit h7-bit ASCII aka LATIN1 then... etc.

Drawbacks: already inconsistent legacy databases, that used to work consealing their inconsistency, would fail to work and ask administrator to mend them.
Answer: "Crash early" policy. If it is inconsistent - the earlier it be fixed - the less are chances for data loss. Or at least the less is amount of lost data.

P4: deny connections without explicitly specified charset. If NONE is specified explicitly - do NO conversion at all, as told in FB docs.

Drawback: some databases would stop to work until admins would read manuals and learn to specify it.
Answer: If current practice is fragile and unreliable, it is good thing to enforce change of practice. Before data would actually be lost due to environment change and you would be blamed not for temporary outage, but for complete disaster.

Post-condition: if NONE would still be treated as "do transcode guessing by local OS and local language", then RAW-BINARY charset is good to be introduced. Meaning exactly complete 100% ban on any transcoding ever. Data from FDB file should pass to end-developer's code and back without single bit changed. Expect developer to only utilize this in extreme condition when he needs it. Low level surgery.

P1 and P2 are my favorite options.
P3 is least work.
P4 is most rigid and enforcing of safe practices.

I consider absolutely impossible state when server and client have DIFFERENT expectations about transcoding maps and their engaging or bypassing. That is data loss. Today or tomorrow. That is really to be avoided at all costs. Better slow than halted, better halted than messed.

If server expects client to not-transcode data - it should be so.
If client want to transcode data from WIN1251 - the server should be informed.

Did i expressed my opinion detailed enough and contrast enough this time ?

> your way of ... criticizing design decisions
> that were taken over 10 years ago

> essentially telling us we are idiots

Choose one. There is no place for both statement at once.

If i am ranting, I may be ranting on 10-years-ago position, by those who made them then.
If i am ranting, I may be ranting at today position to keep that decision forever.

But don't mix it.

And since you took it that personal, i consider you are not talking for about '10 years ago legacy' but about your current position. Then why that InterClient argument ?

You also claimed that my proposal would badly break many real apps out there. I was interested to know. I asked you of practical example of such app. I was thrilled to know what practical usecases has to be taken into account when planning charset-related behavior.

Silence.
It seems that you have no such example right off. It seems that when you claimed that my proposals would break those applications as if it is a well-known fact, you only meant that my proposals can potentially harm some maybe-existing application. Should i think that you basically wrote my proposal off as the one made "by idiot" ?

I do not take technical matters for personal insults here.

But when i just been hurt by the situation, where client and server, both under the same FB project umbrella, had different expectations, and i see a proposal to make misunderstanding even more deep, not to fix misunderstanding but instead turn it into purposed misinformation of server by client - i am really scared very much.