firebird-architect - Re: [Firebird-Architect] Re: Well, here we go again

Subject	Re: [Firebird-Architect] Re: Well, here we go again
Author	unordained
Post date	2008-06-20T23:18:53Z

> The main problem with the relational model is that we're at the verge of
> handling complex models described by huge and complex semantic
> definitions and introduction of ontologies into standard data management
> that relational model (even if we would dump SQL with it's closed world
> assumption for good) can't handle sufficiently even with sophisticated
> semantic extensions.

It could be helpful for the discussion to show examples of such situations -- what problems are
hard to solve with the relational model, what data hard to organize?

Jim mentioned inheritance earlier, to which there are several known "solutions":
- one table with fields from all subtypes (messy, even messier with multiple-inheritance;)
- one table per concrete class (messy for selects against the base type -- this is what Postgres
did, hard to enforce constraints as each concrete class has its own PK definition;)
- one table per extension (requires lots of updatable views to be defined, very carefully, and
there's an issue of row identity, particularly for non-exclusive multiple-inheritance, where the
extended table can't share a PK with its parent, because it has several parents and you'd rather
not have to extend all base classes from a single universal base class the way Java does with
Object.)

The point is that if we're to judge models based on how well they support our needs in the real
world, rather than just academically (I believe the relational model pretty well won that battle
already), we'll need examples.

> The second problem of relational model (which is in fact more or less
> common problem of all current data management technologies, Nimbus
> included) is that's designed to work best with "basic" (or at least
> limited set of) data types (that are then formed into relations,
> classes, whatever) we all know and use for ages (although with
> occasional extension or improvement to make it worse), while it's
> starting to be clear that we need a better way to represent/encode
> information in our data stores (hint: unified recursive class/type
> definition and composition).

The relational model does not, itself, require datatypes to be simple. It's mute on the issue. The
choice between a simple tuple containing complex attributes and a complex tuple containing simple
attributes is supposed to be an aesthetic one -- there are guidelines, but no hard rules. (Much
like picking a primary key among available candidate keys -- the relational model doesn't care
there either.) It's all a physical implementation issue, which means there's an opportunity to fix
the problem, somewhere, without scrapping the relational model.

My pet peeve about datatypes: (sorry about the length -- shotgun approach. keep hoping one way or
another of explaining the problem will reach someone.)

The type system we're stuck with currently -- and this is true in the database world and the
procedural world -- is the result of a physical issue, not a logical issue. We simply don't have a
good way to quickly answer the question "is this value usable as an X", or put another way, "does
this value allow me to do Y with it?" The same issue comes up with files, right? Why do we use file
extensions? Because running a query against each file, every time, to determine if it's a valid PDF
document (or could be) would be expensive -- so we precalculate the type. But as far as the PDF
viewer is concerned, all it cares about is "the filed passed must be usable as a PDF" -- meaning it
has to have a specific binary layout. Why do we have integers and characters as datatypes? Because
we need a quick way to determine whether or not the binary data is usable for certain operations,
such as sums. We don't want to have to run a query against the whole table only to discover that
one value wasn't usable in a sum.

There's no rule that says that the only typing systems possible are based on this premise. We
should be able to define logical constraints in a database that say "check(value is integer or
(value is text and value like '%ml'))". Why can't we? Why not be able to store both integers and
text values in the same field, *as is*?

Because we haven't solved the efficiency problem of determining, on the fly, how a value can be
used -- whether or not it meets arbitrary constraints. (If you passed that field to a procedure
whose input parameter had a constraint of "check(value is integer and value > 35)", it would have
to check each one, fail on the text rows, check the value, etc.) A relational database is supposed
to be a logical construct -- but we're still shackled by these physical constraints.

Our solutions have been to tag values as being of a specific type, and as an improvement, of
tagging some specific types as being subclasses of others -- but we still bind values to specific
types, and that screws us over functionally. How do we define "some A's are X's and some B's are
X's"? What is X? It's a constraint -- a boolean function that takes a value and tells you if that
value is in its set, but we're not sure how to fit that into an overall typing system. We keep
making it a filter, a second stage after filtering by specific datatype. First you check that it's
an A, then you see if it's also an X. If you need a procedure to work with X's of either A or B
type, well, you create two functions, do some funky routing, maybe add a superclass over A and B so
you can do some magic, ... it's a mess.

The classic example is this: you have a type of "ellipse", with a binary definition that contains
its axes; you have a procedure that says it can only accept circles -- how do the two interact?
Logically, if the axes of the ellipse happen to be equal, it *is* a circle, yes? It's not a circle
in the binary data-representation sense (the two types would probably be defined very differently
in terms of bytes), but logically, that value is a circle! Maybe it would be acceptable for the
procedure to require that it be given either a circle (in terms of exact datatype) or an ellipse
(again, specific datatype) so long as the ellipse happens to be a circle -- something you can test
on an ellipse without casting.

This isn't a relational problem, it's much larger than that. If you do solve the type issue, make
sure you solve it for procedural languages too, as they're not going away, and people will expect
the database and procedural languages to have equivalent functionality (which is why they gripe
about OO today.)

Note that postgres already provided custom datatypes -- last I checked, the only famous ones are
GIS-related. You don't find a proliferation of new custom datatypes available for download; people
seem, on the whole, to stick to the basics. That might be the result of interoperability issues, I
don't know. Maybe everyone is afraid to use RDBMS-specific features -- at which point this whole
enterprise is lost. (Inertia, as you said.) Maybe it's that any custom datatypes would then need
custom mappings to other custom datatypes, making the whole world of B2B that much more
complicated. (Every business defining custom datatypes internally, without sharing them as part of
any international standards?) Maybe it's the mismatch between database and procedural systems.
Maybe it's just too complex -- remember that most people have trouble even coming up with a
relatively good database design, without worrying about custom types.

-Philip