firebird-architect - RE: [Firebird-Architect] The Wolf on Firebird 3

Subject	RE: [Firebird-Architect] The Wolf on Firebird 3
Author	Claudio Valderrama C.
Post date	2005-11-08T06:56:05Z

> -----Original Message-----
> From: Firebird-Architect@yahoogroups.com
> [mailto:Firebird-Architect@yahoogroups.com]On Behalf Of Jim Starkey
> Sent: Martes, 08 de Noviembre de 2005 0:02
>
> However, it is char*, since, after all, it represents characters. In C
> and C++, char is neither signed nor unsigned, but a representation for a
> character. Given the way computers work, every implementation has to
> decide whether char is a signed byte or an unsigned byte. Most, but not
> all, pick signed. But it's arbitrary.

Yes, MS picks signed and Borland uses unsigned even in the same HW platform.
Adriano had problems with some ctype.h functions in isql and he solved them
by forcing the chars to UCHARs before being passed to isspace(), that was
the specific troublemaker (but you're likely to find the issue with any
function in that library).

Or the problem Dmitry Sibiryakov solved:

int getNextInputChar()
{
...
// readline found EOF
if (lastInputLine == NULL) {
return EOF;
}
...
// cast to unsigned char to prevent sign expansion
// this way we can distinguish russian ya (0xFF) and EOF (usually (-1))
return (unsigned char)lastInputLine[getColumn++];
}

> The mapping from char* data to
> sequences of glyphs are character sets, and the ordering of glyphs are
> collations. By near universal convention, char* is a pointer to
> characters and signed char* and unsigned char* are pointers to binary,
> non-character data.

This is taken from [Stroustrup 2000]:

BEGIN QUOTE

For example:
char c = 255; // 255 is "all ones," hexadecimal 0xFF
int i = c;

What will be the value of i? Unfortunately, the answer is undefined. On all
implementations I know of, the answer is depends on the meaning of the "all
ones" char bit pattern when extended into an int. On a SGI Challenge
machine, a char is unsigned, so the answer is 255. On a Sun SPARC or an IPM
PC, where a char is signed, the answer is -1. In this case, the compiler
might warn about the conversion of the literal 255 to the char value -1.

END QUOTE

Now what does it have to do with my original question? Some programs may
work with MBCS and their special functions to decode characters that may be
represented by one or more bytes in a variable way. But other programs may
work in fixed unicode (in C++ it would be wchar_t, two bytes per char
always) or the four-byte unicode to handle everything possible. As long as
you're converting from UTF-8 to a wide character, you may have nasty sign
surprises, so I was thinking those UTF-8 interfaces are better served by
UCHAR than char.

C.