Subject Re: [Firebird-Architect] Re: Special Relativity and the Problem of Database Scalability
Author Jim Starkey
Dalton Calford wrote:
> I have been watching this thread, and I think the big difficulty happens to
> be with the concept that data changes. Data has a time of relevance, but
> does not change.
> With limited disk space, you consider that any older values have no
> importance, thus they are removed from the database. In accounting, no data
> is ever irrelevant, thus it is never removed. You may archive the data,
> making it harder to query, but, you do not destroy it. You keep track of
> every action upon a tuple by a journal. That journal includes all changes,
> in the order they are recorded, from where that change occurred and who
> performed the change.
Here's the problem, Dalton. Even if the old data was retained, there
still must be mechanism to get at it. A record that was legitimately
deleted just ain't there no more.

But it is an interesting problem, and one that NimbusDB is going to
address, at least partially. A critical problem -- one that probably
everyone around production databases has experienced -- is what to do
when an authorized user has done bonehead operation and deleted,
destroyed, or overwritten important data. I remember a support call
from the early Interbase days:

Customer: I just accidentally deleted everything in an important
table. What should I do?

Interbase: Don't worry. Unless you do a commit, it's all still there.

Customer: I did a commit.

Interbase: Don't panic. As long as you don't access the table, the
records haven't been garbage collected and we can get them back.

Customer: I just did a count to see if they really went away.

Oh, well.

Unlike traditional database systems, NumbusDB persists a database by
serializing distributed objects to disk files. Each serialization is a
separate file, so nothing is every overwritten. By a combination of
logging replication messages and retaining old serializations, it is
(theoretically) possible to reconstruct the state of the database at
any give point in time without shutting down active database. Ann calls
this time-travel. I call it human fault-tolerant.
> Now, the journals are isolated into batches, and when a batch is closed, all
> batches in the same period are then accumulated and the period becomes 'In
> transition'. The period then is reviewed and 'closed', becoming
> historical. Each node of the cloud would be sent that information which
> is considered relevant to the needs of the local entities.
> Now, this structure was developed long before computers or sql were ever
> developed. It was developed for the use of corporations spanning many
> countries with multiple conflicting data needs while being audited by
> different countries with different taxation and recording rules.
> The mechanics of the process are being lost as manual bookkeeping is being
> forgotten, but the problems that where developed to deal with the problems
> of conflicting updates, where all dealt with long ago.
> Now we are once again having data that is migrating from a single set of
> books to a large distributed cloud of data. I do not see why we can not
> step back and use those data rules for all databases, accounting or
> otherwise.
> What it means is that updates would never destroy the older data - and
> querys would require specifying the period of interest.
> In setting up such a system, certain assumptions would need to be made - a
> super-transaction if you will, that go beyond the current transaction or
> connection but include the current data connection point, time period and
> data importance/value.
> I hope I am being clear in the concept, as this deserves far more thought
> than what a simple email can provide. Perhaps it will just get people
> thinking outside of the box and realize that many of the problems that are
> currently being worked on, where solved along time ago.
> On 30 January 2010 10:01, Milan Babuskov <milanb@...> wrote:
>> Jim Starkey wrote:
>>>> E.g. two concurrent transactions, Tx1 registers a deposit of $10 in my
>>>> account, Tx2 a deposit of $20. Tx1 executes on Node A, Tx2 on node B.
>> Tx1
>>>> commits first. When Tx2 commits, it updates its own record (it is
>>>> oblivious
>>>> to the other update under rule 2), the database remains consistent so
>> the
>>>> commit proceeds. The Tx1 deposit is lost.
>>> No, that's covered by the rule that a transaction can't update a version
>>> of a record that it didn't / couldn't see. In other words, a classical
>>> Firebird update conflict.
>>> The mechanism is that each record has a deterministic resolution agent.
>> Hello,
>> I apologize in advance if what I'm about to write has already been
>> "invented" and dismissed before...
>> Looking at the discussion you have, it strikes me that people trying to
>> build distributed systems are still in the box of the previous systems
>> they built and are not able to look at things outside of that box.
>> (sorry if my English is a little bit crude).
>> The whole idea about each transaction reading the value A and storing
>> the value B seems very wrong to me in disconnected, "cloud" environment.
>> It calls for synchronization and conflict resolving as long as you have
>> it. The problem is how to solve the conflict that propagates through the
>> network on nodes and happens 50 nodes away from points there it was
>> really created.
>> It would be much better if transactions don't store the "old state" and
>> "new state" but rather just "delta" state. Now, "delta" is not a simple
>> value, because there are many ways to interpret why something had a
>> value of "10" before and has "20" now. This calls for abandoning SQL as
>> inadequate tool and replacing it with something that would do the job
>> properly.
>> This new language could keep the relational concept for data, but should
>> be able to express deltas as functions. A simple example would be
>> "increaseBy" operator. Instead of x = x + y, one would use
>> increaseBy(x,y) in such language. Instead of storing "y" in new record
>> version, one would store "increaseBy(x,y)". Of course, except for basic
>> math, all functions would be user-defined at application level. This
>> means that developers writing applications for such model would need to
>> express each data alteration as a function showing "what" is done, not
>> "how".
>> Potential problem with this approach is that developer would need to
>> know beforehand what kind of "function" might be possible for certain
>> table columns. This is why I believe those functions should be defined
>> in advance, as part of database metadata - maintained with some kind of
>> DDL statements. I haven't thought about this throughly, but functions
>> that express conflict resolution (i.e. what do to when you have to apply
>> multiply() and increase() coming from two different nodes at the same
>> time) could also be stored and applied automatically if they are present.
>> I hope any of this makes some sense to you.

Jim Starkey
NimbusDB, Inc.
978 526-1376

[Non-text portions of this message have been removed]