Subject Re: [IB-Architect] Identifier naming woes
Author Jim Starkey
At 10:18 PM 5/24/01 -0400, David Jencks wrote:
>>
>> There is a simple trick that obviates the need for identifiers. The
>> basic idea is that here is a single (internal) registry for "tokens"
>> (cannonical strings). The registry is a simple hash table of known
>> strings. An arbitrary string is passed into the registry. If it
>> matches a known string, the address of the string is returned, otherwise
>> the new string is copied into the registry and its address returned.
>>
>> All strings associated with database objects (from the parser, meta-
>> data handlers, etc) are sanitized through the string registry. Two
>> "tokens", therefore, can be correctly compared for equality by
>> comparing their addresses. A hash table of object names can be
>> maintained by taking the address of the tokens mod table size.
>>
>
>
>At first I hated this idea, and when trying to explain why I realized how
>good it is. My remaining question is, what happens when the hash table
>needs to expand (e.g. using extensible linear hashing)? Not that my
>knowledge of hashing is stellar, but I don't know of any scheme that
>expands the hash size without moving some items. Wouldn't this require
>updating all the moved token values? I suppose you could keep references
>to each location the token is used.... but this is starting to sound
>complicated.
>

The smart thing to do is allocate a large buffer(s) to hold the
string, so the strings wouldn't need to be moved (which, of course,
would be impossible anyway). This is important for other reasons,
as well. Memory allocations are surprisingly expensive. Not
only is there the internal structure of the heap, but also the
need to interlock the heap against thread switches. Managing
the string memory as part of the token registry is simple and
virtually free. (The very clever hacker will note that a
hash table can be safely "read" without a lock, which is a
big plus.)

I've never been a fan of fancy hashing schemes. There are two
cases. First, the access cost to a node is cheap (e.g. in memory),
so who cares? Second, the access cost is high (e.g. on disk),
in which case hash always loses (this is a fun argument).

>Also, this might make the system tables kind of unreadable, unless you also
>tokenized all the string values in data. I did read about some project in
>france in the early 80's that tried this, but I don't know how it worked
>out.
>

No, this is only an in-memory symbol lookup system. No change in the
ODS, a net reduction in the lines of code.

>Would this be a major revamping of most of firebird's internals or just a
>minor change?
>

Mostly removal of ughly code. It could be done in conjunction
with the addition of multi-level namespaces.

Jim Starkey