Subject Re: [Firebird-Architect] The Wolf on Firebird 3
Author Jim Starkey
Claudio Valderrama C. wrote:

>>-----Original Message-----
>>From: Firebird-Architect@yahoogroups.com
>>[mailto:Firebird-Architect@yahoogroups.com]On Behalf Of Jim Starkey
>>Sent: Lunes, 07 de Noviembre de 2005 23:01
>>
>>
>>I thought we had covered that. Everything across the interface is
>>UTF-8.
>>
>>
>
>Then, is UTF-8 considered char* or UCHAR*, does it need special care with
>the 8th bit?
>
>
>
In the FbDbc interface, all character data is UTF-8. UTF-8 is an 8 bit
character set, the lower 7 bits of which is ASCII. There are plenty of
sites on the Web that lay out the structure of UTF-8. When all is said
and done, however, UTF-8 is a mechanical mapping into 32 bit Unicode.

The eighth bit doesn't need any more special care than the previous
seven. All eight are significant.

However, it is char*, since, after all, it represents characters. In C
and C++, char is neither signed nor unsigned, but a representation for a
character. Given the way computers work, every implementation has to
decide whether char is a signed byte or an unsigned byte. Most, but not
all, pick signed. But it's arbitrary. The mapping from char* data to
sequences of glyphs are character sets, and the ordering of glyphs are
collations. By near universal convention, char* is a pointer to
characters and signed char* and unsigned char* are pointers to binary,
non-character data.