firebird-java - Re: [Firebird-Java] Connection character set information

Subject	Re: [Firebird-Java] Connection character set information
Author	Mark Rotteveel
Post date	2017-03-11T16:47:19Z

On 11-3-2017 16:50, 'William L. Thomson Jr.' wlt-ml@...
[Firebird-Java] wrote:

>> Jaybird doesn't have a connection property lc_type, so observations that
>> lc_type=NONE don't work can be explained by that. You need to use
>> lc_ctype (or its alias encoding). Non-existent connection properties are
>> ignored. In Jaybird 2.2 and earlier lc_type=NONE might seem to work,
>> because that was the default for lc_ctype anyway.
>
> That was a typo, I mean lc_ctype. For some reason I had some inconsistency
> with that.

Maybe you should switch to its alias encoding instead, less chance of
typos ;)

>> Also be aware that using charSet=ISO-8859-1 is equivalent to using
>> lc_ctype=ISO8859-1 (or lc_ctype=ISO8859-1&charSet=ISO-8859-1); it is not
>> the same as using lc_ctype=NONE (nor using
>> lc_ctype=NONE&charSet=ISO-8859-1), unless all the columns in your
>> database are NONE or ISO-8859-1.
>
> I likely should not have been using NONE. That was just what Jaybird 2
> defaulted to so I was experimenting with that. Since the DBs are ISO-8859-1,
> just using that.
>
> Though may switch to UTF-8. I avoided it due to bloating the db without any
> change in data. But that is seeming to happen with FB 3.x as is, so not sure
> switching to UTF-8 will bloat it any more.

I'm not sure what you mean by bloating in the case of FB3 (it should
behave similar to FB2.5, otherwise you might have hit a bug). One of the
main drawbacks of UTF8 in Firebird is that it reduces the max number of
characters you can store in a CHAR or VARCHAR by a factor 4 (eg only
8191 instead of 32764), and it could cause issues with the maximum
length of the parameter or value blocks in the protocol, if the total
(maximum) length of parameters (or result columns) exceeds 64K bytes
(where blobs count as 8 bytes).

>> In Jaybird 2.2:
>> * lc_ctype=NONE means use connection encoding NONE and interpret
>> everything using the default JVM encoding (eg Cp1252 on Windows in
>> Western Europe (+ US?), UTF-8 on most Linux environments)
>> * lc_ctype=NONE&charSet=ISO-8859-1 the same, but instead of the JVM
>> default, use ISO-8859-1
>
> I was likely not using character sets correctly at all. Though since not
> working with special characters, and most of ISO-8859-1 maps over to UTF-8,
> though not so much in reverse. I never had issues with older versions of
> Jaybird.

Yes, ISO-8859-1 defines only 191 characters, while - according to
Wikipedia - UTF-8 encodes 1,112,064 codepoints ('characters'). In short
all of ISO-8859-1 maps to UTF-8.

Purely looking byte-wise, only bytes 0-127 map from ISO-8859-1 directly
into UTF-8 (ie the ASCII range), if you try to decode ISO-8859-1 bytes
128-255 as UTF-8, you'll end up with question marks in Java (or a
transliteration error in Firebird). If you encode ISO-8859-1 characters
128-255* to UTF-8, they will have a 2-byte encoding per character.

*: I'm ignoring the fact that ISO-8859-1 does not define characters for
some values in that range.

>> In Jaybird 3.0:
>> * lc_ctype=NONE means connection encoding NONE and interpret columns
>> with character set NONE using the default JVM encoding, and interpret
>> columns with an explicit character set in their equivalent Java
>> character set
>> * lc_ctype=NONE&charSet=ISO-8859-1 the same, but instead of the JVM
>> default, use ISO-8859-1
>>
>> In both versions:
>> * lc_ctype=<firebird charset>* use connection encoding <firebird
>> charset> and interpret in the equivalent Java character set
>> * charSet=<java charset> use Firebird equivalent of <java charset> as
>> connection encoding and interpret in <java charset>
>> * lc_ctype=<firebird charset>*&charSet=<java charset> use connection
>> encoding <firebird charset>, but interpret in <java charset>
>
> Thanks that helps. Not sure I have seen it broken down like that before. Not
> sure I need both lc_ctype and charSet. Mostly just needed to set something for
> Jaybird 3. Due to requiring that now.
>
> But I guess to be safe, one should set both.

No, it is better to pick one. Specifying both lc_ctype and charSet is
risky if you use mismatched values (say lc_ctype=WIN1250 and
charSet=Cp1252 instead of Cp1250). The possibility to specify both is an
oddity that allows you to 'translate' between character sets if they
happen to have been stored incorrectly. It is a feature that I almost
broke early in Jaybird 3 because I initially thought it was a bug that
you could have mismatched values for character sets other than NONE.

Mark
--
Mark Rotteveel