Subject Re: [firebird-support] getByte('UTF-8').length
Author Mark Rotteveel
On 15-5-2014 17:49, Łukasz Bączek baczeklu@... [firebird-support] wrote:
>> Warsaw (without any non-ASCII) will be 6 bytes.
>
> Try the "Warszawa" is 7 bytes.

Warszawa is 8 characters, so would at minimum require a VARCHAR(8). And
as they are all in ASCII, with UTF-8 that is 8 bytes, however to store
UTF8 internally, Firebird has to reserve 4*8 = 32 bytes (because
Firebird has to assume the worst case byte requirement), however that
should be entirely transparent to you.

>> > I have a field in a table (base Firebird)
>> > declared in this way
>> >
>> > "KOD" VARCHAR (6)
>>
> "KOD" VARCHAR (6) CHARACTER SET UTF8 COLLATE UNICODE_CI_AI
>
>> What is the character set of the field (or the default character set of
>> your database)?
>>
>> > IBExpert multiplies the size of the VARCHAR * 4
>>
>> Could you describe this in more detail. I don't use IBExpert and I have
>> no idea what you mean with this.

> http://ibexpert.net/ibe/index.php?n=Doc.ConvertYourFirebirdApplicationsToUnicode
>
> The importance of field size
>
> The extra storage required by different character sets needs to be taken
> into consideration when specifying field size. For example, when you are
> working in a non-Unicode database:
>
> CREATE TABLE TEST5 (TXT VARCHAR(32000))
>
> is not a problem, even though it’s not necessarily a good idea to define
> your VARCHAR(32000) that big. But you may have a problem in the Unicode
> database, because Firebird has to consider that the information that is
> stored here is could be, for example, a complete list of Chinese
> characters. As every single character needs up to 4 bytes, it must have
> the room to store up to 32000*4. So the actual maximum VARCHAR length
> with UTF8 is 8191 and not 32000.

...

This doesn't mean that "IBExpert multiplies the size of the VARCHAR *
4". It means that the storage of a single UTF-8 character in Firebird
takes up to 4 bytes. Fields in Firebird can be max 32768 bytes each (and
a VARCHAR takes 2 bytes for the length, so a VARCHAR can be max 32766
bytes). As UTF-8 needs a max of 4 bytes per character this leads to a
maximum field length of 8192 for CHAR and 8191 for VARCHAR.

From your original post:

> So I am also downloading this value multiplied by four what would have
> gave it (24 bytes), but my thinking here is probably incorrect and
> should not multiply this value by 4 because Firebird me screams that
> does not fit.

You yourself don't need to do any multiplying (or division): that is the
job of the driver (unless you purposefully use NONE everywhere).

> and here a question for you which value should be taken into account
> because it multiplied like so it means that something kicked while
> getting the size in bytes of UTF8?

I still don't really understand your problem. You have a VARCHAR(6)
field and you try to fit 8 characters into it: Firebird is right to
complain about that.

My impression is that you thank that as a VARCHAR(6) CHARACTER SET UTF8
takes up to 6 * 4 = 24 bytes, that you will be able to fit 24 ASCII
characters in it: that isn't the case.

Mark
--
Mark Rotteveel