firebird-architect - Case Insensitivity (revisited)

Subject	Case Insensitivity (revisited)
Author	Jim Starkey
Post date	2000-04-03T18:16:50Z

I'd like to review the discussion from last week and try
to sort out the opinions that flew by.

The discussion started with a desire for case insensitive
sorting. Stated as a requirement, it was:

1. The ability to order a record set in a case insensitive
manner.

This was immediately amended to include searching, then efficient
(i.e. indexed searching). As requirements:

2. The ability to select records (apply a predicate) in a
case insensitive manner.

3. A mechanism to optimize #2 with an index providing an
appropriate index exists.

4. A mechanism to create an "apppropriate index".

Additional requirements that should have been stated are:

5. The ability to order by and search for (apply a predicate to)
field in a sensitive manner cannot preclude the ability to
to order by or search for a field in a case sensitive
manner.

In other words, the base requirements are abilities to search
and order in both case sensitive and case insensitive manners
and the ability to create indexes to optimize the searches.

The following alternatives were presented to satisfy the
requirements:

A. Use the existing national language collating mechanism
to define a case-insensitive collating sequence.

B. Define a case-insensitive character data type.

C. Define a case-insensitive field type and a set of
expression type rules to determine collation and
equality rules.

D. Define an extension to ORDER to specify case insensitive
ordering and a set of case insensitive operators to
perform case insensitive comparisons.

E. Using a combination of views and the UPCASE function
to do case insensitive searching and ordering.

Among the complications are a non-implemented section of SQL/92
describing a variation on alternative C and the apparent
existence of multiple case insensitive collating sequences
in some languages.

Alternative A suffers from an intrinsic inability to mix case
sensitivity and insensitivity in the same statement. It would
be very useful to be able to order a record set primarily by
a specific field case insensitive and secondarily by the same
field case sensitive. Alternative A also runs into nasty
problems with stored procedures. Also, in my opinion, language
semantics that depend on transcient modes external to the syntax
is bad language design. If you can't look at a language block
and determine what it does, the language design is at fault.

Alternative B breaks the transitivity of the equality operator.
If FIELD1 is a case insensitive field and FIELD2 is a case
sensitive field:

"ABC" = FIELD1 ("ABC") is true
FIELD1("ABC") = FIELD2("abc") is true
"ABC" = FIELD2 ("abc") is false

If the rules are changed for that mixed mode expressions are
case sensitive then case insensitive retrievals based on
string constants fail.

Alternative C is very peculiar. This particular piece of SQL/92
exempted, I know of no language where a syntactic operator
semantics are overridden by a sub-expression operator. All other
computer languages that overload syntactic operators do so based
on the type of sub-expressions. There is a very good reason
for this: In general, a compiler doesn't always know the
sub-expression operator, as in a variable or non-builtin function
call. It is unfortunate that the SQL committee would standardize
such a poorly thought out idea. The problems with UDFs are,
I believe, insolvable. I suppose a creative person could
try to salvage this porker by adding case insensitively tags
to UDF declarations, variable, parameters, as well as overrides
for max, min, the two SQL case expressions, etc., but it would
be wrong. Expressions should return typed values, not type
values plus ephemera.

Alternative D is clearly sufficient, but inconsistent with SQL
standard. In a previous existence my policy was that Interbase
implemented SQL; it didn't fix it or extend it. But those were
the good old days when we did our innovating is GDML while waiting
the SQL to catch up. But GDML is gone, and the alternatives
are SQL extensions or stagnation. We shouldn't lightly declare
a particularly SQL committee feature brain dead and go our
separate way. But if we must, we must.

Alternative E is doable. In fact, the only extension required
is expression indexes, which are good thing in their own right.
However, I find the alternative unsatisfying. Requiring a view
definition containing two instances each searchable character
field (one to carry mixed case, the other upper case) is a crock.
If it were acceptable, we wouldn't be having this discussion now.

Jim Starkey