firebird-support - Re: [firebird-support] getByte('UTF-8').length

Subject	Re: [firebird-support] getByte('UTF-8').length
Author	Łukasz Bączek
Post date	2014-05-15T15:49:03Z

W dniu 15.05.2014 17:18, Mark Rotteveel mark@... [firebird-support] pisze:

On 15-5-2014 17:02, Łukasz Bączek baczeklu@... [firebird-support] wrote:
> Hello
> I write in Java one thing and needs to check if the data word, eg
> "Warszawa" includes storage space encoded as UTF8

I am not sure what you mean with "data word".

> I started googling for this and found something like this
>
> String s = "Warsaw"
> s.getByte ('UTF-8'). length - returns me probably 7 bytes

Warsaw (without any non-ASCII) will be 6 bytes.

Try the "Warszawa" is 7 bytes.

> I have a field in a table (base Firebird)
> declared in this way
>
> "KOD" VARCHAR (6)

"KOD" VARCHAR (6) CHARACTER SET UTF8 COLLATE UNICODE_CI_AI

What is the character set of the field (or the default character set of
your database)?

> IBExpert multiplies the size of the VARCHAR * 4

Could you describe this in more detail. I don't use IBExpert and I have
no idea what you mean with this.

http://ibexpert.net/ibe/index.php?n=Doc.ConvertYourFirebirdApplicationsToUnicode

The importance of field size

The extra storage required by different character sets needs to be taken into consideration when specifying field size. For example, when you are working in a non-Unicode database:

CREATE TABLE TEST5 (TXT VARCHAR(32000))

is not a problem, even though it’s not necessarily a good idea to define your VARCHAR(32000) that big. But you may have a problem in the Unicode database, because Firebird has to consider that the information that is stored here is could be, for example, a complete list of Chinese characters. As every single character needs up to 4 bytes, it must have the room to store up to 32000*4. So the actual maximum VARCHAR length with UTF8 is 8191 and not 32000.

This also needs to be taken into consideration when you are using variables; you have to think about the same issue. You should always take into consideration that variables might be subject to internal limitation.

And simply replacing your large VARCHARs with blobs is not necessarily the solution. Every time you change it, the versions are stored in the memory which can have a huge negative impact on server performance. So if you do work with variables that are blobs, do not manipulate them too often. Or alternatively work on VARCHAR variables and combine them in the last step to a single blob.

> So I am also downloading this value multiplied by four what would have
> gave it (24 bytes), but my thinking here is probably incorrect and
> should not multiply this value by 4 because Firebird me screams that
> does not fit.
>
> and here a question for you which value should be taken into account
> because it multiplied like so it means that something kicked while
> getting the size in bytes of UTF8?

Again I am not entirely certain what you mean with this. It sounds as if
you are using connection character set NONE, while the database field is
UTF8.

The storage of a UTF8 field in Firebird is 4x the character length as
that is the maximum needed. Some tools will display the field size based
on (storage size) / (bytes per character), but when the connection
character set is NONE they don't always look at the character set of the
column itself and use (bytes per character) = 1.

Mark
--
Mark Rotteveel