Subject Re: [firebird-support] understanding characters sets
Author Helen Borrie
At 17:23 15/08/2008, you wrote:
>Helen Borrie wrote:
>
>> Strings are written to the database using the defined character set,
>> which will be either the default character set defined for the database
>> or, if present, the character set defined for the column they are
>> written to. For string data to be transliterated correctly for both
>> writing and reading, the connection character set must be the same as
>> the destination character set.
>
>I do not fully understand this.
>
>It sounds as if I have a column with e.g. UTF8, then I must connect with
>UTF8 for transliteration to work correctly.
>
>Huh?
>
>What if I have two columns, one with UTF8 and one with ISO 8859-1? Then
>I'd have to use two different connections to get both transliterated
>correctly - one connection for each character set.
>
>This can't be how it's supposed to work, can it?
>
>So, I must be missing something?
>
>Before reading your post, I thought that if I have a connection with
>e.g. UTF8, then all strings passed through that connection are assumed
>to be UTF8, and if that data is going into a column with a different
>character set, it will be transliterated to that character set.

That's correct. I can see how you read that other paragraph to mean what you took it to mean. But, to put it another way:

Say your DB character set is UTF8...you are connected with UTF8..your connection is busy transliterating input from <whatever comes in> to UTF8 for the client. Suddenly, it finds itself needing to transliterate to ISO8859-1 instead. No problem - it does exactly that.

>(Excepting NONE and OCTETS).
>
>On the other hand, this also seems problematic. In markd_mms' situation,
>text pasted from web pages would probably often be in ISO 8859-1, but
>might also be UTF8. So, he'd need to be able to send both character sets
>to the DB in some way. In general, an application might need to use
>different character sets for input to different columns. If the all
>strings have to be sent in the connection's character set, that would be
>problematic.

No. It's a problem when the input coming from the client is not well-formed for *either* character set (the connection charset or the destination column's charset). For a simple example, say the OP did a copy from website text that was ASCII with some 8-bit character images rendered from &xxx; elements in the HTML. The character stream that he pastes into waiting parameters in the client cannot possibly be well-formed and malformed strings can't be transliterated. Data being collected in this fashion off a web page will be particularly prone to this. You can get a reasonable indication of whether the data you are copying in this fashion is well-formed or is more like alphabet soup by inspecting what the rendered characters look like in the page source....

./heLen