Subject Re: [Firebird-Architect] Re: Well, here we go again
Author Jim Starkey
unordained wrote:
> It could be helpful for the discussion to show examples of such situations -- what problems are
> hard to solve with the relational model, what data hard to organize?
>
> Jim mentioned inheritance earlier, to which there are several known "solutions":
> - one table with fields from all subtypes (messy, even messier with multiple-inheritance;)
> - one table per concrete class (messy for selects against the base type -- this is what Postgres
> did, hard to enforce constraints as each concrete class has its own PK definition;)
> - one table per extension (requires lots of updatable views to be defined, very carefully, and
> there's an issue of row identity, particularly for non-exclusive multiple-inheritance, where the
> extended table can't share a PK with its parent, because it has several parents and you'd rather
> not have to extend all base classes from a single universal base class the way Java does with
> Object.)
>
Sorry if this is a repeat, but it's worth taking up in some detail.

The semantic extension allow simple handling of classes of things that
have some common attributes and other attributes that aren't so common.
The example I gave was a clothing merchant. All of his articles have an
SKU, a description, a unit price, type of material, etc. But shirts
have collar size and sleeve length, pants have waist and inseam, shoes
have size and width, etc. Sometimes he wants to handle the common
attributes of all articles and sometimes he want to handle just pants or
just shirt, and want to see the attributes specific to those records.

Postgres tried to handle with this with many tables and views to
aggregate the records. I don't think that works well. A simpler
implementation has a single base table. This requires that the lower
layers of the data store understand the type hierarchy, but this isn't a
serious problem.

I don't see any purpose to multiple inheritance in the semantic world.
I'd go further and say that multiple inheritance is almost any domain is
one of mother nature's gentle ways of telling you that your application
design is lousy. C++ has it, Java doesn't (uses interfaces instead),
and Java is the better language for it. There are lots of other pieces
of OO baggage that people want to hang off the semantic model that
should be resisted.

So, single parent, primary key must be defined in the base table, other
indexes defined in the type hierarchy as needed, and no sex change
operations. Simple, elegant, and to the point.
> The point is that if we're to judge models based on how well they support our needs in the real
> world, rather than just academically (I believe the relational model pretty well won that battle
> already), we'll need examples.
>
That's the only test that matters.

Type hierarchies exist everywhere in nature. If there weren't type
hierarchies, we wouldn't need adjectives.
>
>
> The type system we're stuck with currently -- and this is true in the database world and the
> procedural world -- is the result of a physical issue, not a logical issue. We simply don't have a
> good way to quickly answer the question "is this value usable as an X", or put another way, "does
> this value allow me to do Y with it?" The same issue comes up with files, right? Why do we use file
> extensions? Because running a query against each file, every time, to determine if it's a valid PDF
> document (or could be) would be expensive -- so we precalculate the type. But as far as the PDF
> viewer is concerned, all it cares about is "the filed passed must be usable as a PDF" -- meaning it
> has to have a specific binary layout. Why do we have integers and characters as datatypes? Because
> we need a quick way to determine whether or not the binary data is usable for certain operations,
> such as sums. We don't want to have to run a query against the whole table only to discover that
> one value wasn't usable in a sum.
>
> There's no rule that says that the only typing systems possible are based on this premise. We
> should be able to define logical constraints in a database that say "check(value is integer or
> (value is text and value like '%ml'))". Why can't we? Why not be able to store both integers and
> text values in the same field, *as is*?
>
Actually there is such a rule: Comparison semantics. Numbers are
compared numerically, strings are compared character wise according to
collation rules, and dates are compared after adjustment for base date.
There are also rules for comparing dissimilar types. When comparing a
string to a number, the string is converted to a number and an error
thrown if this isn't possible. Comparing numbers and dates, on the
other hand, is just an error (unless you want to define certain
unnatural acts...).

Even so, in most systems, you can fetch any abstract value in any
specific type and get either a sanctioned conversion or an error.
> Because we haven't solved the efficiency problem of determining, on the fly, how a value can be
> used -- whether or not it meets arbitrary constraints. (If you passed that field to a procedure
> whose input parameter had a constraint of "check(value is integer and value > 35)", it would have
> to check each one, fail on the text rows, check the value, etc.) A relational database is supposed
> to be a logical construct -- but we're still shackled by these physical constraints.
>
We're constrained by logic, not implementation. Nothing will let you
compare "next Tuesday" and "14" and get a rational answer. Efficiency
is a different issue. Personally, I like the relational model because
it gives an implementor a lot of options to execute a high level request
in an efficient manner.
> Our solutions have been to tag values as being of a specific type, and as an improvement, of
> tagging some specific types as being subclasses of others -- but we still bind values to specific
> types, and that screws us over functionally. How do we define "some A's are X's and some B's are
> X's"? What is X? It's a constraint -- a boolean function that takes a value and tells you if that
> value is in its set, but we're not sure how to fit that into an overall typing system. We keep
> making it a filter, a second stage after filtering by specific datatype. First you check that it's
> an A, then you see if it's also an X. If you need a procedure to work with X's of either A or B
> type, well, you create two functions, do some funky routing, maybe add a superclass over A and B so
> you can do some magic, ... it's a mess.
>
I disagree. It's not a mess. A function that takes a date as an
argument should be expected to work if passed a number. It may work if
passed a string and the string can be converted to a valid date. But
again, these are properties of logic itself, not data.
> The classic example is this: you have a type of "ellipse", with a binary definition that contains
> its axes; you have a procedure that says it can only accept circles -- how do the two interact?
> Logically, if the axes of the ellipse happen to be equal, it *is* a circle, yes? It's not a circle
> in the binary data-representation sense (the two types would probably be defined very differently
> in terms of bytes), but logically, that value is a circle! Maybe it would be acceptable for the
> procedure to require that it be given either a circle (in terms of exact datatype) or an ellipse
> (again, specific datatype) so long as the ellipse happens to be a circle -- something you can test
> on an ellipse without casting.
>
These aren't database issues. They are issues, but exist in almost all
languages. Databases don't make them better or worse.
> This isn't a relational problem, it's much larger than that. If you do solve the type issue, make
> sure you solve it for procedural languages too, as they're not going away, and people will expect
> the database and procedural languages to have equivalent functionality (which is why they gripe
> about OO today.)
>
Oh, I guess we agree, then.
> Note that postgres already provided custom datatypes -- last I checked, the only famous ones are
> GIS-related. You don't find a proliferation of new custom datatypes available for download; people
> seem, on the whole, to stick to the basics. That might be the result of interoperability issues, I
> don't know. Maybe everyone is afraid to use RDBMS-specific features -- at which point this whole
> enterprise is lost. (Inertia, as you said.) Maybe it's that any custom datatypes would then need
> custom mappings to other custom datatypes, making the whole world of B2B that much more
> complicated. (Every business defining custom datatypes internally, without sharing them as part of
> any international standards?) Maybe it's the mismatch between database and procedural systems.
> Maybe it's just too complex -- remember that most people have trouble even coming up with a
> relatively good database design, without worrying about custom types.
>

Some very good arguments can be made for complex types, for example, a
blob containing a known object cluster serialization with methods to
extract specific attributes. The example I was given is an airline
ticket object containing all legs of a flight, a history of changes, a
history of payments, etc. Normalizing it would be horrible. Sometimes
it can be handled in database context with user-defined methods by
deserializing the object cluster and invoking a object method, and
sometimes the gook gets shipped back to an application program that
knows a great deal more.

--
James A. Starkey
President, NimbusDB, Inc.
978 526-1376