Subject Re: [ib-support] FB 1.0 and GBAK problems
Author Claudio Valderrama C.
"Daniel Rail" <daniel@...> wrote in message
> Here's a twist to the case. When using unicode no problem occurs. When
> using normal ASCII for the character set, I get the arithmetic overflow
> error immediately. I just checked my database, where there was no error,
> got a surprise after performing the transformation from varchar(18)
> to varchar(6) unicode.

At least you can't shrink the field directly. The engine complains. There's
a misleading message about characters instead of bytes (maybe because I
reused a message that already existed) but the length is checked:
SQL> create table t_unic(a varchar(18) character set unicode_fss);

SQL> alter table t_unic alter column a type varchar(6);
Statement failed, SQLCODE = -607
unsuccessful metadata update
-New size specified for column A must be at least 54 characters.
=> this should be 54 bytes.

SQL> alter table t_unic alter column a type varchar(6) character set
Statement failed, SQLCODE = -607
unsuccessful metadata update
-New size specified for column A must be at least 54 characters.

SQL> alter table t_unic alter column a type varchar(17);
Statement failed, SQLCODE = -607
unsuccessful metadata update
-New size specified for column A must be at least 54 characters.

SQL> alter table t_unic alter column a type varchar(18);
=> no need to put the charset. The charset is preserved.

> The field is defined as "varchar(6) character set
> unicode_fss" and the value in the field is "0101014876". In my book, this
> shouldn't happen. Definitely, when I tried copying the value from this
> field to a field varchar(6) character set ascii, the arithmetic overflow
> error popped up.

Let's clear a difference: Windows NT uses unicode, the same than it's
defined in C as wchar_t, two bytes per character, always. FB uses
UNICODE_FSS. In this encoding, a character may use one, two or three bytes,
depending on its numeric value. The idea was the most of the time, the
character will fit into one byte, hence it's cheaper than using always
2-byte per char (NT) or 4-byte per char (Linux) for unicode. Hence, it's
possible that you squeeze more characters than the logical limit (measured
in characters, not bytes). For example
f char(3) character set unicode_fss;
uses 3 characters but 9 bytes. Depending on the characters, you can put up
to 9 characters. In the worst case, you can only put 3. I've talked to the
implementor (Dave) and he says that it would be possible to carry the number
of bytes per character internally, so the engine doesn't accept more than
the logical limit, even if there's room for more. At that point, my
uninformed opinion is that this would be good for charsets that use a fixed
char/byte ratio, but even when it's good for avoiding users mistakes, it
defeats the idea of unicode_fss, since you are not compressing anything. We
could be using instead fixed 3 byte/char in unicode, no unicode encoding.
IMHO, of course. Of course, if you put 5 single-byte characters into that
field "f" shown above, when trying to send it to an ASCII field with 3
positions available, you will get a string truncation error.

> Is there somewhere in Firebird's code that might be letting this go
> by?

I think it's the core design under unicode_fss. This isn't a proprietary
extension of IB/FB, of course, but a standard for encoding characters.

> And, why simply renaming the field name gives an error when doing the
> backup,

I've forgotten the multiple bug fixes that I did to metadata modification
module. I'm not implying that they are hundreds, but that they were rather
obscure paths of execution on a more obscure even source code, so I suspect
there're still cases that aren't covered. For now and unlike IB, in FB:
- you can't shrink a string field by mistake by toggling between char and
- you preserve the charset, so if the new size is greater or equal than the
current size, you succeed.

Then I think that simply renaming the field has followed just another buggy
path inside the engine. Have you been in a swamp? How many mosquitoes bite
you in a minute there?

> but if I copy to a new field, with the same definition, I don't get
> the error.

No, because charset to same charset is fast copy and even if it was done
with all the charset rules, the same characters fit nicely in the new field
due to unicode_fss rules.

> The odd thing is that with the test case that i created, I
> never got an error, even by following my original steps.

This means the bug is enough smart to hide itself. A non-reproducible test
case is like no case at all... or worse, because you have an user swearing
that there's a bug she cannot demonstrate.

> I did perform a simple test to show this.
> 1. I created a new database.
> 2. I created a table with 1 field(field1) of type "varchar(18) character
> set unicode_fss".
> 3. I inserted a few records, making sure that some of the values were over
> 6 characters long.
> 4. I then proceeded to created a temp_field of type "varchar(6) character
> set unicode_fss".
> 5. Copy data from field1 to temp_field: update table1 set
> No errors occurred.

Raw copy. Enough space to contain the source field. Probably all characters
you inserted are using one or two bytes each one, no more.

> 6. Execute: select * from table1
> The resulting data in field1 and temp_field are the same.

And it will work nicely.

> 7. Create a second temp_field2 of type "varchar(6) character set ascii"
> 8. Copy data from field1 to temp_field2: update table1 set
> Arithmetic overflow error occurs.

No, because you created some entries with more than 6 bytes, they cannot fit
in an ASCII field that effectively is 6 characters and 6 bytes long.

> Is there an explanation why it is happening this way?

Explained in previous paragraphs.

> Is it as
> designed?

Unfortunately, yes.

> I know that unicode_fss has 3 bytes per character, but looking
> at the results, it's almost as if it's the field length that is used for
> comparison, instead of the field's character length.

Three bytes per character in the worst case. Unicode_fss uses the same idea
than MBCS support in Delphi: some characters are single characters, others
have a prefix and use two bytes, but unicode_fss goes up to 3 bytes. Of
course, it's much simpler to use widechar or wchar_t even if you waste
space. I think that FSS/MBCS compression was thought totally biased toward
western languages, were you most likely will save storage space with such
techniques, because most characters will fit in one byte.

To be able to work in comparisons and transformations, the engine converts
internally to unsigned short, that's the custom equivalent of wchar_t in C.

Claudio Valderrama C. - -
Independent developer
Owner of the Interbase® WebRing