Subject | A Fresh Look at Collations |
---|---|
Author | Jim Starkey |
Post date | 2010-06-18T18:11:27Z |
The time has come to take a serious look at collations for NimbusDB that
has prompted me to take a fresh look at the issues and come up with
ideas possibly worth bouncing around. I've spent a lot of time and
effort ignoring international issues, a fruitful endeavor but, alas, a
dead end. In the interest of cross pollination and exploiting folks
more experienced in this area than I am, here is the approach I'm taking
in NimbusDB:
1. The database engine itself is strictly utf8 only. Character set
conversions are purely client side.
2. Collations are loadable rather than hard coded, represented in XML.
3. An external collation builder is driven primarily from the Unicode
DUCET (Default Unicode Collation Element Table), with provision
for additional runtime rules, character subsets, etc. The
collation builder does an analysis of the DUCET weights for a give
character subset to minimize byte code explosion. A sample rule,
for example, would be L2 weights be applied backwards to
accommodate the French.
4. In addition to characters and weights, a collation contains
separate rules for comparison and collation (I think this is a
significant departure from accepted practice). The alternatives
are Exact (codepoints matches exactly), Upcase (codepoints map to
the same upper case character as define by the Unicode standard),
L1 (base character match), L2 (base character + accent match), L3
(base character + accept + case match), and L4 (all that plus tie
breakers).
The rational for #4 is while the rules for comparison are likely to be
application specific, people will still want sorts to collation
correctly. Most collations are will probably use Upcase for comparisons
and L3 for collations. I'm still sitting on the fence whether provision
should be made for indexes. If indexes are to be used to optimize
"select ... where ... order ... limit ...", the index rule should match
the collation, otherwise it should match the comparison. I think.
Any thoughts, arguments, or wisdom?
--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376
[Non-text portions of this message have been removed]
has prompted me to take a fresh look at the issues and come up with
ideas possibly worth bouncing around. I've spent a lot of time and
effort ignoring international issues, a fruitful endeavor but, alas, a
dead end. In the interest of cross pollination and exploiting folks
more experienced in this area than I am, here is the approach I'm taking
in NimbusDB:
1. The database engine itself is strictly utf8 only. Character set
conversions are purely client side.
2. Collations are loadable rather than hard coded, represented in XML.
3. An external collation builder is driven primarily from the Unicode
DUCET (Default Unicode Collation Element Table), with provision
for additional runtime rules, character subsets, etc. The
collation builder does an analysis of the DUCET weights for a give
character subset to minimize byte code explosion. A sample rule,
for example, would be L2 weights be applied backwards to
accommodate the French.
4. In addition to characters and weights, a collation contains
separate rules for comparison and collation (I think this is a
significant departure from accepted practice). The alternatives
are Exact (codepoints matches exactly), Upcase (codepoints map to
the same upper case character as define by the Unicode standard),
L1 (base character match), L2 (base character + accent match), L3
(base character + accept + case match), and L4 (all that plus tie
breakers).
The rational for #4 is while the rules for comparison are likely to be
application specific, people will still want sorts to collation
correctly. Most collations are will probably use Upcase for comparisons
and L3 for collations. I'm still sitting on the fence whether provision
should be made for indexes. If indexes are to be used to optimize
"select ... where ... order ... limit ...", the index rule should match
the collation, otherwise it should match the comparison. I think.
Any thoughts, arguments, or wisdom?
--
Jim Starkey
Founder, NimbusDB, Inc.
978 526-1376
[Non-text portions of this message have been removed]