firebird-support - Re: Followup newbie question about case of identifiers

Subject	Re: Followup newbie question about case of identifiers
Author	John Craig
Post date	2005-02-01T06:23:34Z

--- In firebird-support@yahoogroups.com, "Ann W. Harrison"
<aharrison@i...> wrote:

> John Craig wrote:
>
> > I propose a modification that'll make a lot of sense for those of

> > migrating from other DB's which don't make assumptions about

unquoted

> > identifiers in SQL statements: could we have a mode in a future
> > version that skipped forcing identifiers parsed from SQL to upper

case

> > when they're not quoted?
>
> Probably not... It's a SQL standard thing - identifiers are
> case-insensitive and fold to upper case.

Ann,

I don't read the standard quite that way, but I can see that this
interpretation would be widely accepted and perhaps was intended by
the standard's authors. The standard I have easy access to is: ANSI
INCTIS 135-1992 (R1998) which was apparently formerly known as: ANSI
X3.135-1992 (R1998)--this doesn't seem to be the so-called SQL3
standard, but the behavior in question is goes back to this standard
at least, I'm sure. Here's the language from Section 5.2 <token> and
<separator> Syntax Rules 10 (page 69 of the standard; 87th page of
PDF):

"The <identifier body> of a <regular identifier> is equivalent to an
<identifier body> in which every letter that is a lower-case letter is
replaced by the equivalent upper-case letter or letters.
This treatment includes determination of equivalence, representation
in the Information and Definition Schemas, representation in the
diagnostics area, and similar uses."

Now, I suppose you COULD interpret this as meaning that you would
actually replace the lower-case letters with their upper-case
equivalents for determining equivalence of identifiers and storing
them in system tables. But the language doesn't quite say that with
complete clarity. What it says is that upper and lower-case letters
cannot be treated as distinctive in identifiers. (For anyone else who
had to look it up: a <regular identifier> is one that doesn't have
double quotes around it. So, if I do not quote an identifier,
conforming SQL is supposed to treat upper and lower-case variant of a
letter as identical.)

So, as far as that goes, FB behaves according to the standard in that
if you specify the name of a table in lower-case, it gets treated as
upper-case for the purposes of equivalence and storage in the system
tables. (The prejudice of the standard's authors is pretty clearly in
line with your statement, but it's also self-contradictory in at
least one place.)

According to the standard, if I put something in double quotes, then I
get case-sensitivity when comparing to a non-quoted identifier
according to rule 13 from the same section (page 70; PDF page 88)
[John: a delimited identifier is surrounded by double-quotes and can
have blanks in it]:

"A <regular identifier> and a <delimited identifier> are equivalent
if the <identifier body> of the <regular identifier> (with every
letter that is a lower-case letter replaced by the equivalent
upper-case letter or letters) and the <delimited identifier body> of
the <delimited identifier> with all occurrences of <quote> replaced by
<quote symbol> and all occurrences of <double-quote symbol> replaced
by <double quote>), considered as the repetition of a <character
string literal> that specifies a <character set specification> of
SQL-TEXT and an implementation-defined collation that is sensitive to
case [John: note the difference here--sensitive to case], compare
equally according to the comparison rules in Subclause 8.2,
"<comparison predicate>"."

Case-insensitive? Yes--unless it's quoted. Fold to upper case? I guess
I can believe that was the intent. What it actually says is that a
lower-case or mixed-case representation is equivalent to the
corresponding all upper-case representation unless you've quoted the
identifier. But it also says "This treatment" [meaning replacement of
lower-case letters with upper-case] applies to storage in system
tables--but is "This treatment" just for the purpose of evaluating
equivalence? It seems to me the standard shouldn't bother to care
about the actual storage, but it does seem the authors may well have
intended it to.

So, this should be perfectly valid:

create table a_mixed_Case_Identifier
( one_column integer,
, "TWO_column" varchar(30)
)
...
select ONE_COLUMN, "TWO_column"
from a_MIXED_case_identifier

And the way FB behaves would seem to be consistent with this.

According to a number of spots in the standard, the equivalence of
identifiers is defined in Subclause 8.2, "<comparison
predicate>"--unfortunately, 8.2's content is rather vague. The only
real help you get is in the Syntax Rules that dictate
case-insensitivity for regular identifiers and sensitivity for
delimited identifiers.

But, in the introduction to Section 21 Information Schema and
Definition Schema, there's this statement (which certainly includes
the notion of the upper-case folding, but seems to me to be oddly
self-contradictory):

"<identifier>s are represented in the tables of the Information Schema
by <string value expression>s corresponding to their <identifier
body>s (in the case of <regular identifier>s) or their <delimited
identifier body>s (in the case of <delimited identifier>s). Comparison
of <identifier>s is defined in Subclause 8.2, "<comparison
predicate>". Where an <identifier> has many equivalent forms, the
one encountered at definition time is stored (of course, any lower
case letters appearing in a <regular identifier> will have been
converted to the corresponding upper case letter before the
<identifier> is stored in any table of the Information Schema)."

This is an odd statement. It does seem to reflect a prejudice on the
part of at least some of the standard's authors which isn't justified
by the standard's actual EBNF definitions and syntax rules. Moreover,
the statement itself seems internally inconsistent: is the "equivalent
form...encountered at definition time" to be stored or is it the
upper-case equivalent form that is stored? That is to say, the part of
the sentence that begins "of course" is hardly an elaboration; it
contradicts what goes before--unless some other kind of equivalence is
intended and case is supposed to be excluded.

Also, consider that the above statement follows this opening
paragraph:

"The views of the Information Schema are defined in terms of base
tables. The only purpose of the Definition Schema is to provide a data
model to support the Information Schema and to assist understanding.
An implementation need do no more than simulate the existence of the
Definition Schema, as viewed through the Information Schema views."

In short, the Information Schema is defined as a series of SQL
views--so you wouldn't "store" anything in them. So, this section of
the standard is confused. However, it does seem that the authors of
the standard were in the habit of using all upper-case letters for
identifiers (the examples are all upper-case--and very difficult to
read as a result).

So, I concede that it is possible to read the standard as requiring
the switch to upper-case for storage in the system tables--although I
think that is an assumption that's not quite warranted by the
standard's language with the exception of that one comment in the
Introduction to section 21. Personally, I agree with treating them as
equivalent and with the part about storing the variant that is first
encountered. Forcing everything to upper-case is giving up way too
much in readability. So, how do I get myself onto the standards
committee next time around? ;-\

Thanks for your time!

John