firebird-architect - Re: [Firebird-Architect] Re: Well, here we go again

Subject	Re: [Firebird-Architect] Re: Well, here we go again
Author	Pavel Cisar
Post date	2008-06-21T17:59:28Z

unordained napsal(a):

>> The main problem with the relational model is that we're at the
>> verge of handling complex models described by huge and complex
>> semantic definitions and introduction of ontologies into standard
>> data management that relational model (even if we would dump SQL
>> with it's closed world assumption for good) can't handle
>> sufficiently even with sophisticated semantic extensions.
>
> It could be helpful for the discussion to show examples of such
> situations -- what problems are hard to solve with the relational
> model, what data hard to organize?

<SNIP>

> The point is that if we're to judge models based on how well they
> support our needs in the real world, rather than just academically (I
> believe the relational model pretty well won that battle already),
> we'll need examples.

It all boils down to semantics. Relational calculus doesn't care about
meaning, period. You can join apples to oranges as long as it happens
that joining attributes are "compatible" (yeah, that ought to be solved
to some degree by more ingenious type system, but not completely), but
the main problem is that the result has no semantics attached and could
be completely bogus and there is no way you can tell. Sure, most of it
is in fact built in constraint of SQL (or any other relational query
language), but it's that way because relational model is based on
relational algebra, which is just that - an algebra, a computational
tool. It's hard to impossible to carry semantics through relational
computation. So all relational query languages basically gave up on this
and left all responsibility to developers. And developers who live in
world of CPU cycles, algorithms, registers, numbers, strings, objects
and data structures generally sucks at seeing through information
representation (data) to actual information.

There were many attempts to solve this problem, but all of them failed
miserably. Notably because they are/were focused on putting more
semantics to schema (i.e. permanent storage) with highest goal to enrich
constraint definition and information relation pathways, but at the
output you'll always end just with homogeneous mess of rows and richly
defined attributes. Also the lack of standardization plays it's role
here, as any really good idea that gets implemented just by one system
is not widely used, so we always end using less common denominator which
sucks.

The sole advantage of relational system over other "similar" ones is
that relational algebra it is based on is simple, complete, flexible and
easy to implement. No wonder that relational systems are so widely used
and popular, so do are the files in file system. The relational systems
are used just as simple ACID data stores with basic enforcements of data
validity and we have to do everything on upper layers. And great part of
what we actually do is just programming translation of information
encoding between semantic contexts (and no, semantic extensions or XML
are not solution), and there is no way how relational system can help
you to automate this. It wasn't much a problem till now, but as our need
to work with information from various contexts increases almost
exponentially (the Internet really changed everything), it became to be
real pain.

>> The second problem of relational model (which is in fact more or
>> less common problem of all current data management technologies,
>> Nimbus included) is that's designed to work best with "basic" (or
>> at least limited set of) data types (that are then formed into
>> relations, classes, whatever) we all know and use for ages
>> (although with occasional extension or improvement to make it
>> worse), while it's starting to be clear that we need a better way
>> to represent/encode information in our data stores (hint: unified
>> recursive class/type definition and composition).
>
> The relational model does not, itself, require datatypes to be
> simple. It's mute on the issue. The choice between a simple tuple
> containing complex attributes and a complex tuple containing simple
> attributes is supposed to be an aesthetic one -- there are
> guidelines, but no hard rules. (Much like picking a primary key among
> available candidate keys -- the relational model doesn't care there
> either.) It's all a physical implementation issue, which means
> there's an opportunity to fix the problem, somewhere, without
> scrapping the relational model.

Yeah, it's mute about it because relational algebra basically doesn't
care about them at all. And that's the problem you can't fix, believe
me. At the most basic level, types are just implementation artifact we
had to use, and by making them more complex you would only complicate
the implementation. In fact, focusing on types will worsen then issue.

Type basically defines what you can and can't do with the value, but
what that really means? It's easy to tell for simple types, numbers,
dates and text, but any other type don't just extend or modify the set
of valid values and operations and makes them more complex, it also
carries semantic meaning (objects are most profound example). Enter the
dilemma what should carry the semantics, tables or attributes or both,
how to define the operations, mutual relations, hierarchies etc., and
you'll quickly find yourself in the maze of thousands twisted passages,
all the same, where you'll get lost in a second and could walk in
circles for rest of your life. But don't worry, you'll be in good
company of many otherwise smart people :-)

<snip>

>
> The classic example is this: you have a type of "ellipse", with a
> binary definition that contains its axes; you have a procedure that
> says it can only accept circles -- how do the two interact?
> Logically, if the axes of the ellipse happen to be equal, it *is* a
> circle, yes? It's not a circle in the binary data-representation
> sense (the two types would probably be defined very differently in
> terms of bytes), but logically, that value is a circle! Maybe it
> would be acceptable for the procedure to require that it be given
> either a circle (in terms of exact datatype) or an ellipse (again,
> specific datatype) so long as the ellipse happens to be a circle --
> something you can test on an ellipse without casting.
>
> This isn't a relational problem, it's much larger than that. If you
> do solve the type issue, make sure you solve it for procedural
> languages too, as they're not going away, and people will expect the
> database and procedural languages to have equivalent functionality
> (which is why they gripe about OO today.)

Have you heard about "duck typing" ? Python is a well known example of
duck typing language. This is bound to dynamic languages, thought, and
hard to swallow for static typing advocates, but it *is* a solution, and
a good one from my personal experience (although it has it's drawbacks
in other areas).

> Note that postgres already provided custom datatypes -- last I
> checked, the only famous ones are GIS-related. You don't find a
> proliferation of new custom datatypes available for download; people
> seem, on the whole, to stick to the basics. That might be the result
> of interoperability issues, I don't know. Maybe everyone is afraid to
> use RDBMS-specific features -- at which point this whole enterprise
> is lost. (Inertia, as you said.) Maybe it's that any custom datatypes
> would then need custom mappings to other custom datatypes, making the
> whole world of B2B that much more complicated. (Every business
> defining custom datatypes internally, without sharing them as part of
> any international standards?) Maybe it's the mismatch between
> database and procedural systems. Maybe it's just too complex --
> remember that most people have trouble even coming up with a
> relatively good database design, without worrying about custom types.

Problem is that custom data types (or any tinkering with data types) are
not a solution, they will only divert you from the right path, as they
are all about information encoding, not the information itself.

To explain why, I have to do a detour to the very beginning, sorry. The
root of all problems as I see it is in how we perceive Information. For
decades, database guys teach us that "data are not information and
information is not data", and this mantra was nailed down to everyone's
head, so everybody believe in it and take it as starting point for
further thinking. But it's *not* true, and until we can see it, we can't
get through our current problems in information management area.

Sure, this dichotomy between Data and Information has its merit when
you're looking on the Information problems from specific angle (classic
Information Theory, for example). But modern studies in areas like
biology and physics (notably quantum physics) lead us to different
perception what Information really is. If you will put aside for a
moment the classic perception that Information is (to put it simple) in
the "eyes of the receiver that interprets the data" and focus on what
Information is/could be as an independent entity, you will eventually
come to the conclusion that Information is in fact a *pattern*.

This pattern is always encoded in some way (information doesn't exists
un-encoded in known universe), and you can encode any pattern in many
different ways without losing it (that much about "information is not
data and vice versa"). In fact, you can define any pattern using
mathematic formula, so the "essence" of any information could be best
described/encoded as such formula. Any pattern forms a system on it's
own (i.e. any system is defined by it's information pattern). Patterns
interact, constantly, and it happens under rules we call nature's laws.
But we humans consider Information as "real" only when it has some
*value* for us, i.e. we can percept the pattern and integrate it to our
private pattern (generally speaking, it covers all those real world
situation when you see a car approaching at dangerous vector and
velocity or your brain is assembling the dots on screen in words that
then form to source of your application or your body is incorporating
proteins from your meal). But this happens at various levels outside our
perception as well (for example cells exchange complex protein chains
that carry information vital for them to change their behavior, to
cooperate or attack each other etc.).

There are few things that complicate it a bit for us to understand:

1. it's that we tend to see information as something ethereal,

2. that our perception of patterns is tied to our perception of time
(which is a strange beast that doesn't even exists on it's own as
independent entity), so we see them only as snapshots at the moment of
sampling, or as transitions between states (depends on context and
sampling method) and hence the time is weaved into the pattern itself,

3. and that patterns are also recursively constructed in nature, like
fractals are. It all depends on level you can / want to observe them
(i.e. the encoding context), and thanks to how our mind operates, you
can see (and operate with) only a piece of the puzzle you're actually
focused on. Simple example: you can see the book as a whole, or
decompose it to chapters, chapters to articles, articles to lines, lines
to words, words to characters, characters to curves, curves to dots etc.
But this apply to everything, you can see pattern in society that could
be disassembled to pattern in small groups, groups to individuals,
individuals to organs, organs to cells, cells to proteins, proteins to
atoms, atoms to particles. You get the point. They all form patterns at
each level, and all patterns are connected at some level (including us,
we're part of the system), and interact.

We have many various scientific disciplines that study patterns in
various areas and at various levels, using mathematics as unified
language to describe these patterns. And it's interesting how all these
patterns are basically all the same in all those disciplines when
they're described as mathematic formulas, even being so remote from each
other like sociology and meteorology.

But back to how it relates to our information management issues. It all
stands and falls with the fact, that Information exists only in encoded
form, although this encoding could have many different forms. There is
basically no difference between Information / pattern your DNA carries
encoded using two long polymers on backbones made of sugars and
phosphate groups joined by ester bonds and the same pattern encoded
using sequences of T,A,G,C characters on disk or that nifty DNA double
helix image on your screen, they're just encoded differently. Yet our
whole information technology is obsessively focused on encoding and
related issues instead on patterns itself. It's no surprise as we just
blindly do what we're used to, translate patterns from one encoding
context to another, all the time. It's natural distraction, as our very
functionality as species is based on converting encoded information
between contexts, so we just transfered that to computers in most
simplistic way.

Sure, we *have to* operate with encoded information because there is no
other way, and we need to take the information into specific encoding
contexts to work with them effectively (for example library is not
organized at level of individual chapters), but we should store the
information in a way that would simplify these contexts switches in the
first place. In early days of computing there wasn't easy way how to do
it, and we had to pay primary attention to encoding details, but it's
21th century now, yet we're still lost at encoding level. We somehow
forgot what's important and primary objective here. And it bites us, big
time, as the number of various contexts we need to operate on
dramatically increases. Until we change our focus back on what's
important, we will find themselves solving yet another encoding
transcription problem.

One small example for all, the most important feature of today - web
search - doesn't work with information you're looking for, it simply
looks for character sequences we happen to use to encode the information
that match the pattern you have typed using the same fuzzy rules of
natural language for encoding. It uses one hell of sophisticated
algorithms based on complex mathematics to get as much close to what
you're looking for as it's possible, but there are definitely hard
limits on how far you can get that way (and it starts to be not enough
already). You'll have bad luck just with simple searches for things that
could be described only with commonly used terms, as you'll get relevant
links buried in gazzilion irrelevant ones. And can you imagine that you
would ask Google about three new products in the same price &
functionality category that could replace the gizmo you're used to which
was just discontinued by the vendor? Or what is the closest post office
to your place? You can't do that, although the information is certainly
somewhere out there at Google's reach.

Yes, the whole idea of semantic web is trying to solve that, but this is
not only about web search, it's also about B2B, B2C and the whole
information management in general. And we're back to databases, as
they're at the very bottom of our information pyramid. It's all great
and nice that we're trying to solve the Information problem at higher
levels (semantic web, database federation, LinkedData etc.), but it's an
uphill battle when we have to fight with those stupid relational
databases at the bottom. From the semantic angle, they're as stupid and
unhelpful as flat files, because they're just stores for data, not
information stores.

So I'm sorry about the lengthly, almost philosophical and ranting text,
but it's necessary to explain why focusing on data types is not helping
in any way. You will just end with the old problem of translating one
encoding to another when crossing encoding context boundaries all the
day long, as you would just take it at another level of complexity.

And I see you asking the natural question whether I have a solution to
this :-) Well, I wish I would (although I have an idea or two :-), as
this is indeed an answer worth of Nobel prize, princess and the whole
kingdom. As far as my research went, nobody has a complete solution yet,
but many work on it, pursuing different ways, but almost nobody is
trying to actually take it down to databases itself (at least not
directly). My personal take on this is based on "esperanto" for
information encoding, unified recursive definitions of encoding
contexts, and implementation as cloud & grid hybrids on top of what seem
so be basically an evolved network model storage. It's also clear that
it's necessary to accommodate the RDF/OWL world into it, if not as an
integral part then at least as common external interface. It's hard nut
to crack for single person, so I hold no illusions that I could be the
lucky one, but my research into it made me optimistic enough to believe
that it's doable with current technology, and it's definitely worth the
effort as the prize is most charming. At least it's great intellectual
exercise :-)

best regards
Pavel Cisar
IBPhoenix