Subject | Substring search problem observed with UTF8 / UNICODE_CI_AI |
---|---|
Author | |
Post date | 2013-10-19T02:05:33Z |
I have a database with approx 10k records which I just converted from charset & collation 'none' to UTF8 and UNICODE_CI_AI.
Yes, I understand that substring searches are not optimizable, but with 10k records searching a varchar field with a size of 80 used to take milliseconds and use zero noticable CPU whereas it now takes approx. 4 seconds and consumes an entire CPU until done.
This is a huge difference
The following may be of interest to those with C/C++ knowledge:
I was reading about how some programmers use different philosophies when dealing with UTF-8 internally (which in our case would be the actual memory allocated by the firebird server when using the UTF-8 character set).
Some programmers allocate two bytes for every single UTF-8 character which supposedly means at least double work. Internally, they are working with UTF-16 code points.
Alternatively, other programmers allocate 1 byte for ASCII characters (0-127) and two bytes for less commonly used characters such as latin / european characters. Asian characters are allocated three bytes per character. This means that commonly used ASCII characters translate to less work for the system while asian characters have the disadvantage.
Here is a good article that talks about this in further detail: http://www.utf8everywhere.org/
Yes, I understand that substring searches are not optimizable, but with 10k records searching a varchar field with a size of 80 used to take milliseconds and use zero noticable CPU whereas it now takes approx. 4 seconds and consumes an entire CPU until done.
This is a huge difference
The following may be of interest to those with C/C++ knowledge:
I was reading about how some programmers use different philosophies when dealing with UTF-8 internally (which in our case would be the actual memory allocated by the firebird server when using the UTF-8 character set).
Some programmers allocate two bytes for every single UTF-8 character which supposedly means at least double work. Internally, they are working with UTF-16 code points.
Alternatively, other programmers allocate 1 byte for ASCII characters (0-127) and two bytes for less commonly used characters such as latin / european characters. Asian characters are allocated three bytes per character. This means that commonly used ASCII characters translate to less work for the system while asian characters have the disadvantage.
Here is a good article that talks about this in further detail: http://www.utf8everywhere.org/