Subject Collation with Numeric-Sort and index
Author Magnus Johansson (nimajo)

Hi all,

 

Sorry for being late to the party, finally got some time get my hands on moving from FB2.5 to FB3.

 

I'm looking at creating collations in order to get WHERE and ORDER BY working as my needs, preferably also handling of numeric stored as text sorted in numeric order.

Still haven't decided yet if I should go for ISO8859_1 or for UTF8 though, weighing pros and cons regarding features and impact of bytes.

 

Living in Sweden I think most of you know that I have to deal with åäö.

 

However, I have found some quirks that currently prevent my progress.

 

First I have downloaded the snapshot for FB 3.0.3 for Windows (Firebird-3.0.3.32802-0_x64).

In order to get UNICODE collation with Swedish to work I went to http://site.icu-project.org/download and downloaded icu4c-52_1-Win64-msvc10.zip from where I got the icu*52.dll's and extracted them to FB3. (After I first had deleted all icu*.* files in FB3).

No changes done in intl\fbintl.conf or anything else, is it really that simple?

 

Then I created a database, created some collations and a table with some data:

 

SET NAMES UTF8;

SET SQL DIALECT 3;

 

CREATE DATABASE 'C:\Data\Test\Collation_CI.fdb'

USER 'SYSDBA' PASSWORD 'masterkey'

PAGE_SIZE 16384

DEFAULT CHARACTER SET UTF8;

 

SHOW VERSION;

ISQL Version: WI-V3.0.3.32802 Firebird 3.0

Server version:

Firebird/Windows/AMD/Intel/x64 (access method), version "WI-V3.0.3.32802 Firebird 3.0"

on disk structure version 12.0

 

CREATE COLLATION UTF_SV_CI

       FOR UTF8

       FROM UNICODE

       CASE INSENSITIVE

      'LOCALE=sv_SE';

 

CREATE COLLATION UTF_SV_CI_NUM

       FOR UTF8

       FROM UNICODE

       CASE INSENSITIVE

      'LOCALE=sv_SE;NUMERIC-SORT=1';

 

CREATE COLLATION ISO_SV_CI

       FOR ISO8859_1

       FROM SV_SV

       CASE INSENSITIVE;

 

COMMIT;

 

CREATE TABLE TABLE_T (

    FIELD_UTF_SV_CI     VARCHAR(5) CHARACTER SET UTF8      COLLATE UTF_SV_CI,

    FIELD_UTF_SV_CI_NUM VARCHAR(5) CHARACTER SET UTF8      COLLATE UTF_SV_CI_NUM,

    FIELD_ISO_SV_CI     VARCHAR(5) CHARACTER SET ISO8859_1 COLLATE ISO_SV_CI

);

 

COMMIT;

 

INSERT INTO TABLE_T VALUES ('1',  '1',  '1' );

INSERT INTO TABLE_T VALUES ('2',  '2',  '2' );

INSERT INTO TABLE_T VALUES ('10', '10', '10');

INSERT INTO TABLE_T VALUES ('20', '20', '20');

INSERT INTO TABLE_T VALUES ('a',  'a',  'a' );

INSERT INTO TABLE_T VALUES ('AA', 'AA', 'AA');

INSERT INTO TABLE_T VALUES ('aa', 'aa', 'aa');

INSERT INTO TABLE_T VALUES ('Aa', 'Aa', 'Aa');

INSERT INTO TABLE_T VALUES ('A',  'A',  'A' );

INSERT INTO TABLE_T VALUES ('b',  'b',  'b' );

INSERT INTO TABLE_T VALUES ('B',  'B',  'B' );

INSERT INTO TABLE_T VALUES ('o',  'o',  'o' );

INSERT INTO TABLE_T VALUES ('O',  'O',  'O' );

INSERT INTO TABLE_T VALUES ('x',  'x',  'x' );

INSERT INTO TABLE_T VALUES ('X',  'X',  'X' );

INSERT INTO TABLE_T VALUES ('å',  'å',  'å' );

INSERT INTO TABLE_T VALUES ('Å',  'Å',  'Å' );

INSERT INTO TABLE_T VALUES ('ä',  'ä',  'ä' );

INSERT INTO TABLE_T VALUES ('Ä',  'Ä',  'Ä' );

INSERT INTO TABLE_T VALUES ('ö',  'ö',  'ö' );

INSERT INTO TABLE_T VALUES ('Ö',  'Ö',  'Ö' );

 

COMMIT;

 

 

Now some tests.

 

 

SELECT FIELD_UTF_SV_CI

FROM   TABLE_T

WHERE  FIELD_UTF_SV_CI BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_UTF_SV_CI

===============

1              

10             

2              

20             

a              

A              

aa             

Aa             

AA             

b              

B              

===================

Result as expected

 

 

SELECT FIELD_UTF_SV_CI_NUM

FROM   TABLE_T

WHERE  FIELD_UTF_SV_CI_NUM BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_UTF_SV_CI_NUM

===================

1                  

2                  

10                 

20                 

a                   

A                  

aa                 

Aa                 

AA                 

b                  

B                  

===================

Result as expected

 

 

SELECT FIELD_ISO_SV_CI

FROM   TABLE_T

WHERE  FIELD_ISO_SV_CI BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_ISO_SV_CI

===============

1              

10             

2              

20             

A              

a              

Ä             

Å             

Ö             

ä             

å             

ö             

AA             

Aa             

aa             

B              

b              

===================

Not the expected result.

Both wrong order and records that should not be there.

 

 

Let’s add some index:

 

COMMIT;

 

CREATE INDEX IDX_UTF_SV_CI     ON TABLE_T (FIELD_UTF_SV_CI);

CREATE INDEX IDX_UTF_SV_CI_NUM ON TABLE_T (FIELD_UTF_SV_CI_NUM);

CREATE INDEX IDX_ISO_SV_CI     ON TABLE_T (FIELD_ISO_SV_CI);

 

COMMIT;

 

 

And repeat the queries from before:

 

 

SELECT FIELD_UTF_SV_CI

FROM   TABLE_T

WHERE  FIELD_UTF_SV_CI BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_UTF_SV_CI

===============

1              

10             

2              

20             

a              

A              

aa             

Aa             

AA             

b              

B              

===================

Result as expected

 

 

SELECT FIELD_UTF_SV_CI_NUM

FROM   TABLE_T

WHERE  FIELD_UTF_SV_CI_NUM BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_UTF_SV_CI_NUM

===================

1                   

2                  

10                 

20                 

a                  

A                  

aa                 

Aa                 

AA                 

b                  

===================

Not the expected result.

Where did the 'B' go?

 

 

SELECT FIELD_ISO_SV_CI

FROM   TABLE_T

WHERE  FIELD_ISO_SV_CI BETWEEN '' AND 'b'

ORDER BY 1;

 

FIELD_ISO_SV_CI

===============

1              

10             

2              

20             

a              

A              

å             

Å             

ä             

Ä             

ö             

Ö             

AA             

aa             

Aa             

b              

B              

===================

Not the expected result.

Both wrong order and records that should not be there.

Although the order differs slightly from the same query without index.

 

 

I have done the same tests with the release version of FB 3.0.2 (Firebird-3.0.2.32703-0_x64) as well as tested with icu 53 and 57 for them both with the same result, no differences there.

 

Using the original icu files that comes with Firebird-3.0.2.32703-0_x64 and Firebird-3.0.3.32802-0_x64 and doing the same tests for FIELD_ISO_SV_CI gives the same result, so no difference there if the icu library is replaced or not.

 

I have also done some test with ACCENT INSENSITIVE but I will leave that beast for now.

 

Can anyone shed some light on this or point me to what I may be missing?

 

Thanks for reading,

Magnus Johansson