Subject Re: [Firebird-Architect] Re: UTF-8 (various)
Author Lester Caine
Daniel Rail wrote:
> Hello Lester,
>
> Friday, March 4, 2005, 11:49:24 AM, you wrote:
>
>
>
>>Jim Starkey wrote:
>
>
>>>>Jim, don't forget the field's default collation, before the session's.
>>>>
>>>
>>>I didn't forget it -- I left it out on purpose. In my mind, collation
>>>is a property of data manipulation operation, not the data itself. It
>>>may be useful as a default for indexes containing that field, but I
>>>don't think it should override the default collation specified by a
>>>client. I'm willing to concede that a declaration of case insensitivity
>>>expressed as a collating sequence doesn't work well with this model.
>>>I'm not happy with that, and I'm going to have to think about it some more.
>
>
>>The one thing that people seem to forget about is that the DATA does not
>>have any order at all, so as long as an index is available with a simple
>>primary key then any record can be found. It's only when you want to
>>order the data that you need the collation.
>
> Not true. Try case-insensitive and accent-insensitive comparison. Or
> even case-insensitive comparison.
>
> If I do a search, i.e.: LASTNAME STARTING WITH 'em', I might want all
> the rows where LASTNAME starts with 'E', 'e', and any other accented
> 'e' or 'E', then followed by 'M' or 'm'. If the table has 10 million
> rows and there's only 100 rows that would correspond to that
> criteria, I think an index with the proper collation would be
> welcomed, so that the data is fetched as fast as possible.

You are missing my point. Yes you want a particular index - with a
particular 'insensitivity', and that index can be created and managed
persistently, or built on the fly as required. But it is calculated on
the data and only needs the primary key for the data to find the actual
records. Its how these indexes are managed that is the problem not the
raw data, and it may be that we need a general index to find all the
relevant 'E' records and filter those on the next character(s) if a
correct 'hard' index is not available. Looking at the Unicode data, the
number of variations is not too heavy, but it is still probably too
large to be all possible collations in the one index, so some compromise
may be needed.

--
Lester Caine
-----------------------------
L.S.Caine Electronic Services