Subject | Re: [firebird-support] Re: Treat a VARCHAR OCTETS column as binary in Delphi |
---|---|
Author | Helen Borrie |
Post date | 2005-02-14T02:34:45Z |
At 12:52 PM 13/02/2005 +0000, you wrote:
the precursor of what is now the UTF-8 standard. If you're interested,
there is a lengthy exposition here:
http://czyborra.com/utf/
(btw, "FSS" is not "Fixed Symbol Size" as I suggested last week!)
The Borland implementation of fss-utf is UNICODE_FSS, which stores all
properly interpreted characters -- including the US-ASCII equivalents --
with a fixed word length of 3 bytes.
If compression is a must, then I guess it's a valid approach to store your
UTF-xyz strings as OCTETS. Then you must presume nothing about how the db
engine and client library will treat these bytes when asked to retrieve
them, since OCTETS is nothing but a stream of binary codes with no value
associated with any character repertoire.
That approach presupposes at least the following:
--- that you are using a programming interface that is capable of
interpreting what these byte streams mean, in terms of the character
repertoire required for interpreting them as strings, displaying them
meaningfully, and converting the characters back into the appropriate word
sequences for searching and storage.
Delphi out of the box has very lame support for MBCS. It can support
single-byte charsets with its String and AnsiString types (an array of
char) and MBCS of 1 or 2 bytes (WideString). If you are going to have your
OCTETS data interpreted by your application code as some MBCS of 2 or more
bytes, you are going to have to write that interpretation yourself, or use
some components that will help to do it for you, e.g. the TNT components.
--- that you are certain that DbExpress doesn't impose some length limit on
strings (and is otherwise truncating your data when AsString is called).
Back to UNICODE_FSS ---
"Properly interpreted" means that the engine knew, at the time of storage,
that the incoming data were to be interpreted as UNICODE_FSS. This would
be true if the client's lc_ctype were UNICODE_FSS *and* the blob column was
defined as sub_type 1 *and* with CHARACTER SET UNICODE_FSS. As a
variation, if the database default character set is not UNICODE_FSS (and
hence the lc_ctype UNICODE_FSS would be invalid), the blob could have
stored UNICODE_FSS if it was converted from a string that was cast as
UNICODE_FSS *and* the blob's charset was UNICODE_FSS. Absent these
conditions, the blob would be stored as just a stream of bytes with no
character set association.
If you are seeing "a" as a one-byte symbol in a blob editor, then it's a
good bet that the text being described by the blob editor is not
UNICODE_FSS. The reason for that might be that the blob editor itself has
the capability of casting the blob's byte stream into the OEM charset for
the UI, as the text editor ConText (also written in Delphi) can do.
No doubt Martijn will enlighten....both on DBW and on DbExpress. Martijn
also markets a DbExpress driver for Firebird.
./heLen
> > Try using QuotedStr()Null-terminated strings in ObjectPascal???????????
> >
> > ...field.AsString := QuotedStr('12 3B 00 AA');
> >
>Thanks Helen, but I need to store them in binary format to save space,
>those are actually UTF-16 encoding, which may contain null characters. I
>can store the null characters into the database, but I can't get the full
>data back, as they are treated as null-terminated string when I call
>AsString or GetData.
>Why do I store UTF-16 in VARCHAR OCTETS but not UNICODE_FSS? Coz I can'tUNICODE_FSS is a UTF-8 superset. "FSS" is "FileSystem Safe" - fss-utf was
>beat UNICODE_FSS, same issue with previous post titled "Firebird and
>Unicode queries". I can choose to store UTF-8 in OCTETS which do not
>contain null characters, but my data are mainly Asian characters, which
>take less space in UTF-16.
>
>By the way, I doubt that UNICODE_FSS is actually UTF-8, which store a-z as
>1 byte, and more bytes for others. Because I see only one byte for 'a' and
>two byte for 'ab' in Database Workbench's Blob Editor.
the precursor of what is now the UTF-8 standard. If you're interested,
there is a lengthy exposition here:
http://czyborra.com/utf/
(btw, "FSS" is not "Fixed Symbol Size" as I suggested last week!)
The Borland implementation of fss-utf is UNICODE_FSS, which stores all
properly interpreted characters -- including the US-ASCII equivalents --
with a fixed word length of 3 bytes.
If compression is a must, then I guess it's a valid approach to store your
UTF-xyz strings as OCTETS. Then you must presume nothing about how the db
engine and client library will treat these bytes when asked to retrieve
them, since OCTETS is nothing but a stream of binary codes with no value
associated with any character repertoire.
That approach presupposes at least the following:
--- that you are using a programming interface that is capable of
interpreting what these byte streams mean, in terms of the character
repertoire required for interpreting them as strings, displaying them
meaningfully, and converting the characters back into the appropriate word
sequences for searching and storage.
Delphi out of the box has very lame support for MBCS. It can support
single-byte charsets with its String and AnsiString types (an array of
char) and MBCS of 1 or 2 bytes (WideString). If you are going to have your
OCTETS data interpreted by your application code as some MBCS of 2 or more
bytes, you are going to have to write that interpretation yourself, or use
some components that will help to do it for you, e.g. the TNT components.
--- that you are certain that DbExpress doesn't impose some length limit on
strings (and is otherwise truncating your data when AsString is called).
Back to UNICODE_FSS ---
"Properly interpreted" means that the engine knew, at the time of storage,
that the incoming data were to be interpreted as UNICODE_FSS. This would
be true if the client's lc_ctype were UNICODE_FSS *and* the blob column was
defined as sub_type 1 *and* with CHARACTER SET UNICODE_FSS. As a
variation, if the database default character set is not UNICODE_FSS (and
hence the lc_ctype UNICODE_FSS would be invalid), the blob could have
stored UNICODE_FSS if it was converted from a string that was cast as
UNICODE_FSS *and* the blob's charset was UNICODE_FSS. Absent these
conditions, the blob would be stored as just a stream of bytes with no
character set association.
If you are seeing "a" as a one-byte symbol in a blob editor, then it's a
good bet that the text being described by the blob editor is not
UNICODE_FSS. The reason for that might be that the blob editor itself has
the capability of casting the blob's byte stream into the OEM charset for
the UI, as the text editor ConText (also written in Delphi) can do.
No doubt Martijn will enlighten....both on DBW and on DbExpress. Martijn
also markets a DbExpress driver for Firebird.
./heLen