Subject Re: [Firebird-Java] Re: Trying to run TrackStudio :-)
Author Mark Rotteveel
On 28-6-2012 17:07, the_a_rioch wrote:
>
>>> If Java apps always run in some pre-defined flavour of unicode, if the database is created in some standard charset (here - UTF8), why cannot JayBird query database and choose best matching default.
>>
>> The characterset has to be specified on connect.
>
> True. That might ask for two-step connect.
> Conect with NONE, read default charset. Reconnect.

This is not going to happen, for local connections this might be fine,
but for remote connections it already has some performance implications
as it will take several roundtrips.

>> Also, the connection
>> characterset does not have to match the database characterset at all
>
> Sure. If user overrode connection charset - just obey.
> But if he omitted it, why not to choose best default ?

To be honest, I am not entirely happy with the way it works now. But the
thing is: using NONE as the connection characteret if none is specified
is probably the only best default there is.

>> (firebird server will translate between db charset and connection
>> charset; if possible).
>
> But - if to believe Kuzmenko - not for BLOBs, only for VARCHARs.
> Or - another thread at 2.0 times on russian Python forum - BLOB conversion was planned for some future, but not for released to that date servers.
> Dunno if that still applies to FB 2.1.x or 2.5.x

Actually, blobs are always sent in their defined characterset (a thing
BTW which I believe Jaybird currently doesn't handle well when the blob
characterset deviates from the connection characterset). The driver
should take of conversion here, not Firebird.

>> Guessing for the user could result in 1) decreased performance because
>> it would need to connect twice
>
> which is very fast on FB
>
>> 2) incorrect behavior. You as the
>> developer / db administrator simply have to be explicit when
>> defining the connection.
>
> That is a gotcha. You should have foreknowledge or you're screwed.
> Why not make sensible defaults ?
>
> Person tries FB, it does not work, person ditches FB and goes to suggest against it on every forum.

I admit the documentation should be more explicit about it.

> Frankly, i can hardly think of situation where charset NONE would give correct behaviour and charset UTF8 would not.
> I think that Java developers would take pervasive unicode for granted.
>
> I believe it takes rather intimate knowledge of FB and its legacy to know that it should be enforced into unicode-aware connection.

The problem is that defaulting to UTF8 will not always work fine,
especially not if the database connected to is not UTF8.

However maybe a more intelligent algorithm for deciding on the
characterset is possible (eg when NONE is used, try to use the
characterset of the database, if that is NONE as well then use system
encoding). I created http://tracker.firebirdsql.org/browse/JDBC-257 to
look into this.

<snip>

> BTW, what is that overhead if database and connection both are UTF-8 ?
>
> Are u talking that some characters have up to 4 bytes in UTF 8 so datastreams tend to inflate ? Or that there is check that each character is valid UTF8 character ? or whatever ?

Part of it is the current implementation in Jaybird, but the overhead
can be as much as 4x the declared length of a VARCHAR or CHAR, even when
sending only one character (see
http://tracker.firebirdsql.org/browse/JDBC-237 )

> And what is character set used by Java applications themselves ? UCS-2 ? or UTF-8 ? or... ?

UCS-2 is an old version of UTF-16. Internally Java uses UTF-16, but in
general UTF-8 or the local system encoding is used for communication (it
really depends a lot on the application).

> I just try to imagine which misunderstanding could happen.
>
> TS -> Spring/Hibernate -> JDBC -> Jaybird -> Firebird
>
> At each boundary there are expectations what charset is used to encode binary streams, if expectations mismath things are broken.

All those boundaries simply use Strings and don't concern themselves
with binary streams or anything, except between Jaybird and Firebird.

> But i think that all arrows, except for last, use JRE internal charset, so how could NONE damage anything ?

Problem is as I explained in my other e-mail: if the database is UTF-8,
the connection characterset is NONE and the local system encoding is
WINDOWS-1251, then Jaybird can send byte combinations which are not
valid UTF-8 and therefor causes transliteration errors.

Mark
--
Mark Rotteveel