firebird-architect - Re: [Firebird-Architect] Re: Special Relativity and the Problem of Database Scalability

Subject	Re: [Firebird-Architect] Re: Special Relativity and the Problem of Database Scalability
Author	Dalton Calford
Post date	2010-02-01T14:43:44Z

Hi Jim,

In my industry, we have so many laws regulating how we keep records, we do
not allow deletes or updates to occur at all. The closest we have is the
movement of data from one table/database to another.

Data has a good-til date, ie, you have the record and if it is updated, a
new record is inserted, while the older record has it's good-til date
marked.
Each record also has it's user, timestamp, transaction number etc all
recorded with it.

We can update our billing methods but tell the system to generate an invoice
from a specific date and it can, because all the earlier values are in the
database and the procedures understand the 'effective' date period.

No deletes, just some moved to historical periods, some are on slower
machines/disks as they are rarely accessed while others are archived, but
nothing is deleted/changed.

As for banks and ATM's. ATM networks go down. The networks split and
communications between banks have their problems.

In the old days, this sort of thing allowed 'kiting' but in modern systems,
they moderate the risk by giving accounts maximum daily withdrawl limits,
flags for concurrent account access (effectively a UDP packet as it is fire
and forget) and agreements concerning 'unauthorized loans'.

The assumption is that the downtime is within an acceptable time period, and
that insurance/overdraft agreements cover the risk. The thing that is
important is that the records can not be altered and that a full auditable
record is kept.

We have the same issues in regards to phone calls, fraud calls, billing and
collections and other issues in regards to operations - we have what is
known as definitive and non-definitive answers. Certain operations allow
for non-definitive answers (many times you will see your credit card
statement at the ATM machine saying that they may not include the latest
transactions) while others require a lock-out.

With accounting, they even have the idea of an extra period to deal with the
fact that bills/invoices/checks may not arrive/get entered until after the
month they applied to has passed and been closed.

The reason you close off a period is to get a 'close enough' value to make
decisions. IF after the fact you find that certain values where
misrepresented, you deal with that in the next periods statements with notes
as to the reasons.

Storage space is cheap, the key is efficient data structures and a clear
understanding of the specifications of the systems needs.

In cloud systems, you must not only consider that nodes will appear and
disappear (ie, you are connected to your local back via an ATM but a road
repair has just cut your fibre link) and you must weigh that against the
impact on the end user.

The only difference between the old accounting systems where the
reconsiliation of the books would happen weekly or even monthly, with
computers and todays networks, you may say that the reconciliation period is
a function of minutes.

Again, I hope I am being clear, as this type of discussion really needs to
occur in a face to face situation with hand waving, loud disagreements,
white-boards and of coarse, the best tool for intelligent design, beer.

best regards

Dalton

On 31 January 2010 10:04, Jim Starkey <jstarkey@...> wrote:

>
>
> Dalton Calford wrote:
> > I have been watching this thread, and I think the big difficulty happens
> to
> > be with the concept that data changes. Data has a time of relevance, but
> > does not change.
> >
> > With limited disk space, you consider that any older values have no
> > importance, thus they are removed from the database. In accounting, no
> data
> > is ever irrelevant, thus it is never removed. You may archive the data,
> > making it harder to query, but, you do not destroy it. You keep track of
> > every action upon a tuple by a journal. That journal includes all
> changes,
> > in the order they are recorded, from where that change occurred and who
> > performed the change.
> >
> Here's the problem, Dalton. Even if the old data was retained, there
> still must be mechanism to get at it. A record that was legitimately
> deleted just ain't there no more.
>
> But it is an interesting problem, and one that NimbusDB is going to
> address, at least partially. A critical problem -- one that probably
> everyone around production databases has experienced -- is what to do
> when an authorized user has done bonehead operation and deleted,
> destroyed, or overwritten important data. I remember a support call
> from the early Interbase days:
>
> Customer: I just accidentally deleted everything in an important
> table. What should I do?
>
> Interbase: Don't worry. Unless you do a commit, it's all still there.
>
> Customer: I did a commit.
>
> Interbase: Don't panic. As long as you don't access the table, the
> records haven't been garbage collected and we can get them back.
>
> Customer: I just did a count to see if they really went away.
>
> Oh, well.
>
> Unlike traditional database systems, NumbusDB persists a database by
> serializing distributed objects to disk files. Each serialization is a
> separate file, so nothing is every overwritten. By a combination of
> logging replication messages and retaining old serializations, it is
> (theoretically) possible to reconstruct the state of the database at
> any give point in time without shutting down active database. Ann calls
> this time-travel. I call it human fault-tolerant.
>
> > Now, the journals are isolated into batches, and when a batch is closed,
> all
> > batches in the same period are then accumulated and the period becomes
> 'In
> > transition'. The period then is reviewed and 'closed', becoming
> > historical. Each node of the cloud would be sent that information which
> > is considered relevant to the needs of the local entities.
> >
> > Now, this structure was developed long before computers or sql were ever
> > developed. It was developed for the use of corporations spanning many
> > countries with multiple conflicting data needs while being audited by
> > different countries with different taxation and recording rules.
> >
> > The mechanics of the process are being lost as manual bookkeeping is
> being
> > forgotten, but the problems that where developed to deal with the
> problems
> > of conflicting updates, where all dealt with long ago.
> >
> > Now we are once again having data that is migrating from a single set of
> > books to a large distributed cloud of data. I do not see why we can not
> > step back and use those data rules for all databases, accounting or
> > otherwise.
> >
> > What it means is that updates would never destroy the older data - and
> > querys would require specifying the period of interest.
> >
> > In setting up such a system, certain assumptions would need to be made -
> a
> > super-transaction if you will, that go beyond the current transaction or
> > connection but include the current data connection point, time period and
> > data importance/value.
> >
> > I hope I am being clear in the concept, as this deserves far more thought
> > than what a simple email can provide. Perhaps it will just get people
> > thinking outside of the box and realize that many of the problems that
> are
> > currently being worked on, where solved along time ago.
> >
> >
> >
> > On 30 January 2010 10:01, Milan Babuskov <milanb@...<milanb%40panonnet.net>>
> wrote:
> >
> >
> >> Jim Starkey wrote:
> >>
> >>>> E.g. two concurrent transactions, Tx1 registers a deposit of $10 in my
> >>>> account, Tx2 a deposit of $20. Tx1 executes on Node A, Tx2 on node B.
> >>>>
> >> Tx1
> >>
> >>>> commits first. When Tx2 commits, it updates its own record (it is
> >>>> oblivious
> >>>> to the other update under rule 2), the database remains consistent so
> >>>>
> >> the
> >>
> >>>> commit proceeds. The Tx1 deposit is lost.
> >>>>
> >>>>
> >>> No, that's covered by the rule that a transaction can't update a
> version
> >>> of a record that it didn't / couldn't see. In other words, a classical
> >>> Firebird update conflict.
> >>>
> >>> The mechanism is that each record has a deterministic resolution agent.
> >>>
> >> Hello,
> >>
> >> I apologize in advance if what I'm about to write has already been
> >> "invented" and dismissed before...
> >>
> >> Looking at the discussion you have, it strikes me that people trying to
> >> build distributed systems are still in the box of the previous systems
> >> they built and are not able to look at things outside of that box.
> >> (sorry if my English is a little bit crude).
> >>
> >> The whole idea about each transaction reading the value A and storing
> >> the value B seems very wrong to me in disconnected, "cloud" environment.
> >> It calls for synchronization and conflict resolving as long as you have
> >> it. The problem is how to solve the conflict that propagates through the
> >> network on nodes and happens 50 nodes away from points there it was
> >> really created.
> >>
> >> It would be much better if transactions don't store the "old state" and
> >> "new state" but rather just "delta" state. Now, "delta" is not a simple
> >> value, because there are many ways to interpret why something had a
> >> value of "10" before and has "20" now. This calls for abandoning SQL as
> >> inadequate tool and replacing it with something that would do the job
> >> properly.
> >>
> >> This new language could keep the relational concept for data, but should
> >> be able to express deltas as functions. A simple example would be
> >> "increaseBy" operator. Instead of x = x + y, one would use
> >> increaseBy(x,y) in such language. Instead of storing "y" in new record
> >> version, one would store "increaseBy(x,y)". Of course, except for basic
> >> math, all functions would be user-defined at application level. This
> >> means that developers writing applications for such model would need to
> >> express each data alteration as a function showing "what" is done, not
> >> "how".
> >>
> >> Potential problem with this approach is that developer would need to
> >> know beforehand what kind of "function" might be possible for certain
> >> table columns. This is why I believe those functions should be defined
> >> in advance, as part of database metadata - maintained with some kind of
> >> DDL statements. I haven't thought about this throughly, but functions
> >> that express conflict resolution (i.e. what do to when you have to apply
> >> multiply() and increase() coming from two different nodes at the same
> >> time) could also be stored and applied automatically if they are
> present.
> >>
> >> I hope any of this makes some sense to you.
> >>
> >>
>
> --
> Jim Starkey
> NimbusDB, Inc.
> 978 526-1376
>
> [Non-text portions of this message have been removed]
>
>
>

[Non-text portions of this message have been removed]