Subject | Re: understanding characters sets |
---|---|
Author | markd_mms |
Post date | 2008-08-16T04:50:23Z |
--- In firebird-support@yahoogroups.com, Helen Borrie <helebor@...> wrote:
destination column's charset). For a simple example, say the OP did a
copy from website text that was ASCII with some 8-bit character images
rendered from &xxx; elements in the HTML. The character stream that
he pastes into waiting parameters in the client cannot possibly be
well-formed and malformed strings can't be transliterated. Data being
collected in this fashion off a web page will be particularly prone to
this. You can get a reasonable indication of whether the data you are
copying in this fashion is well-formed or is more like alphabet soup
by inspecting what the rendered characters look like in the page
source....
If I view the source of the web page in firefox then it displays
exactly as it does if I copy and paste the text into notepad - it
isn't a special &xxx; character. If I I look at
http://en.wikipedia.org/wiki/Iso_8859-1 0x92, 0x93 and 0x94 are
missing which I assume means they have no value in that character set.
If I look at http://www.unicode.org/charts/PDF/U0080.pdf 0x92 and 0x93
are reserved.
Considering this, and that I'm not going to be able to convince people
to edit any copied text before saving, should I just specify the
character set of the synopsis column to be NONE and I hope that
wherever it's displayed in the client program can interpret it correctly?
Mark
> No. It's a problem when the input coming from the client is notwell-formed for *either* character set (the connection charset or the
destination column's charset). For a simple example, say the OP did a
copy from website text that was ASCII with some 8-bit character images
rendered from &xxx; elements in the HTML. The character stream that
he pastes into waiting parameters in the client cannot possibly be
well-formed and malformed strings can't be transliterated. Data being
collected in this fashion off a web page will be particularly prone to
this. You can get a reasonable indication of whether the data you are
copying in this fashion is well-formed or is more like alphabet soup
by inspecting what the rendered characters look like in the page
source....
If I view the source of the web page in firefox then it displays
exactly as it does if I copy and paste the text into notepad - it
isn't a special &xxx; character. If I I look at
http://en.wikipedia.org/wiki/Iso_8859-1 0x92, 0x93 and 0x94 are
missing which I assume means they have no value in that character set.
If I look at http://www.unicode.org/charts/PDF/U0080.pdf 0x92 and 0x93
are reserved.
Considering this, and that I'm not going to be able to convince people
to edit any copied text before saving, should I just specify the
character set of the synopsis column to be NONE and I hope that
wherever it's displayed in the client program can interpret it correctly?
Mark