Subject | Re: Problems using LIKE with UNICODE |
---|---|
Author | peter_jacobi.rm |
Post date | 2003-10-18T13:26:42Z |
Hi Pavel,
"pavel_menshchikov" <mpn2001@y...> wrote:
[...]
In addition, there may be a misunderstanding about the use of
'SET NAMES'.
0. SET NAMES
Are you using ISQL under Win32? If not please specify.
Assuming using ISQL under Win32, it is important to note,
that 'SET NAMES' only tells FB server, how the bytes in
character strings should be interpreted. So
SET NAMES UNICODE_FSS tells FB server, that you send UTF-8
character strings. But your Win32 console most likely is
not using UTF-8 as input and output codepage.
You can theoretically set the console to UTF-8 mode using
chcp 65001, but I didn't have much fun with this setting.
You can workaround the problem by preceding all string
constants with a declarator stating the console's codepage,
as in _WIN1251'some russion text'. The output side gets
even more ugly, you must use a double cast, first to the
console codepage, then to CHARSET NONE.
The third possibility, were SET NAMES UNICODE_FSS works best,
but in batch mode only:
a) Prepare all statements in an UTF-8 textfile.
b) run ISQL with input and output redirection
c) View the output textfile by an UTF-8 capable editor.
1. Uppercasing in UTF-8 (UNICODE_FSS)
Only works for 'a'-'z'. This is a known and registered bug.
Only workaround is temporarily casting to WIN1251 and uppercasing
there.
2. Other UTF-8 problems.
I cannot say much about the cause of your server dying, other
that I've also seen it, when using UTF-8. First
you must check for point 0. above. Whereas it is not nice
to crash on this occasion, you must concede that you are
working in garbage in - garbage out mode.
If you are sure that this problem is addressed, you can try
avoiding the LIKE operator and use BETWEEN
LIKE 'xyz' <=> BETWEEN 'xyz' AND 'xyz#'
Instead of # you must use character with a codepoint larger
than all characters you are interested in. For UTF-8 I can
suggest the Euro sign.
Regards,
Peter Jacobi
"pavel_menshchikov" <mpn2001@y...> wrote:
[...]
> SET NAMES UNICODE_FSS;[...]
> DEFAULT CHARACTER SET UNICODE_FSS;[...]
> SELECT * FROM ROOMS WHERE NAME LIKE 'string_in_russian%'[...]
> SELECT * FROM ROOMS WHERE UPPER(NAME) LIKE UPPER('STRING_IN_RUSSIAN%')Unfortunately, you are on the hurting side of two FB bugs here.
In addition, there may be a misunderstanding about the use of
'SET NAMES'.
0. SET NAMES
Are you using ISQL under Win32? If not please specify.
Assuming using ISQL under Win32, it is important to note,
that 'SET NAMES' only tells FB server, how the bytes in
character strings should be interpreted. So
SET NAMES UNICODE_FSS tells FB server, that you send UTF-8
character strings. But your Win32 console most likely is
not using UTF-8 as input and output codepage.
You can theoretically set the console to UTF-8 mode using
chcp 65001, but I didn't have much fun with this setting.
You can workaround the problem by preceding all string
constants with a declarator stating the console's codepage,
as in _WIN1251'some russion text'. The output side gets
even more ugly, you must use a double cast, first to the
console codepage, then to CHARSET NONE.
The third possibility, were SET NAMES UNICODE_FSS works best,
but in batch mode only:
a) Prepare all statements in an UTF-8 textfile.
b) run ISQL with input and output redirection
c) View the output textfile by an UTF-8 capable editor.
1. Uppercasing in UTF-8 (UNICODE_FSS)
Only works for 'a'-'z'. This is a known and registered bug.
Only workaround is temporarily casting to WIN1251 and uppercasing
there.
2. Other UTF-8 problems.
I cannot say much about the cause of your server dying, other
that I've also seen it, when using UTF-8. First
you must check for point 0. above. Whereas it is not nice
to crash on this occasion, you must concede that you are
working in garbage in - garbage out mode.
If you are sure that this problem is addressed, you can try
avoiding the LIKE operator and use BETWEEN
LIKE 'xyz' <=> BETWEEN 'xyz' AND 'xyz#'
Instead of # you must use character with a codepoint larger
than all characters you are interested in. For UTF-8 I can
suggest the Euro sign.
Regards,
Peter Jacobi