Subject | Re: [firebird-support] Allowing Space for Unicode |
---|---|
Author | Helen Borrie |
Post date | 2009-01-29T19:51:24Z |
At 04:18 AM 30/01/2009, you wrote:
Charset NONE (alias ASCII7, USASCII) is a single-byte charset (SBCS). Varchar(128) allows up to 128 bytes for a SBCS. Fb supports some multi-byte charsets (MBCS) that are not unicode, e.g., SJIS_208 is a charset for encoding Japanese characters, 2 bytes for each character, so a varchar(128) will have 256 bytes capacity. UTF8 is a MBCS that can store up to 4 bytes per character, capacity 512 bytes for your varchar(128) in that charset.
Where you will have to watch out is with indexes. Their limits are determined by byte length, not number of characters. So, for example, in a database with an 8KB page size, an index key can be at most 2048 bytes. 2048/4 = 512 characters. But it doesn't stop there: non-binary collations (essential for unicode and for most non-ascii charsets) eat more bytes...lots more in most cases. Multi-segment keys eat even more. For example, the maximum size of an indexed varchar in UTF8 in a DB with an 8Kb page size is 40 characters!
Handy calculator for index sizes (thanks to Ivan Prenosil) here:
http://www.volny.cz/iprenosil/interbase/ip_ib_indexcalculator.htm
./hb
>Hi all,No "rules" needed. If varchar(128) is sufficient for English then it will be sufficient for the same data in other languages. That's because the numbers in char() and varchar() specify the number of characters, not the number of bytes.
>
>Have heard a rumor that America is not the only country in the world ;)
>
>Can anyone offer general rules for setting size attributes of VARCHAR fields?
>For instance, if I believe that Varchar(128) is sufficient for english, then
>Varchar(256) probably should be used to allow for unicode...
Charset NONE (alias ASCII7, USASCII) is a single-byte charset (SBCS). Varchar(128) allows up to 128 bytes for a SBCS. Fb supports some multi-byte charsets (MBCS) that are not unicode, e.g., SJIS_208 is a charset for encoding Japanese characters, 2 bytes for each character, so a varchar(128) will have 256 bytes capacity. UTF8 is a MBCS that can store up to 4 bytes per character, capacity 512 bytes for your varchar(128) in that charset.
Where you will have to watch out is with indexes. Their limits are determined by byte length, not number of characters. So, for example, in a database with an 8KB page size, an index key can be at most 2048 bytes. 2048/4 = 512 characters. But it doesn't stop there: non-binary collations (essential for unicode and for most non-ascii charsets) eat more bytes...lots more in most cases. Multi-segment keys eat even more. For example, the maximum size of an indexed varchar in UTF8 in a DB with an 8Kb page size is 40 characters!
Handy calculator for index sizes (thanks to Ivan Prenosil) here:
http://www.volny.cz/iprenosil/interbase/ip_ib_indexcalculator.htm
./hb