| Subject | Re: [firebird-support] Cascades on very large databases | 
|---|---|
| Author | unordained | 
| Post date | 2010-07-13T17:16:08Z | 
---------- Original Message -----------
From: "sqlsvr" <sqlsvr@...>
[snipped into pieces]
------- End of Original Message -------
equivalent manually-written triggers, or manually-written statements. Feel free
to test it for yourself: how fast it runs will depend on how many rows you're
updating (in total), cache size, CPU, disk, ... which is not something we can
tell you.
The number of relations only matters for the number of relations affected by
the cascade, and the selectivity of the indices in each. Again, your details
will vary.
By "natural key", I assume you mean non-surrogate, probably non-integer? The
indices work the same way regardless; the same prefix compression takes place,
the same selectivity statistics matter, etc. Only relevant if your natural key
would have significantly poorer index selectivity, or much larger size than
some sort of surrogate. Details?
Are you doing a cascade-delete, or a cascade-update? If it's a cascade-update
on this "natural key", and you want to avoid the cost, then don't use that as
the PK/FK. Use it as a not-null unique field (a candidate key) and use an
unchanging surrogate key for your PK/FK. You can change this value without
cascade-updates, and you can decide when you want to fetch it by using an
explicit join.
Firebird, unlike a lot of databases, makes it fairly easy (or even possible) to
cascade-update a primary key value, but that doesn't mean it's something you
should be doing on a very regular basis. Remember that logically, a cascade-
update says "hey, I got this row's identity wrong, let me fix it". That's not
the kind of thing that should be happening every day. And while a database can
fix those references for you internally, it can't fix any external system to
which you may have announced that PK value -- so to play nice with outside
systems, it's a good thing to avoid when possible.
Such a cascade would only block other queries that attempted to modify the same
records your query was modifying, or attempted to insert records referring to
them. Other queries, especially read-only queries, would be unaffected. This is
a benefit of MVCC, if you're coming from an environment where you're used to
writers blocking readers, or writers taking more locks than they deserve (sql-
server, for example, likes to upgrade to page or table locks when it
shouldn't.) It's also easier to tune than in, say, Oracle, where you'd have to
fiddle with the transaction slot count (INITRANS, ITL) to avoid issues with
unrelated updates blocking each other.
elsewhere. What exactly were you meaning to ask?
I'm feeding the trolls at this point, but ...
If you want a "database" that uses pointers, I recommend you look at any of the
so-called post-relational databases (aka pre-relational, hierarchical, CODASYL,
MUMPS, or network databases.) They use pointers internally. They also require
manual navigation, and are easily corrupted, have more problems with multi-user
modifications, and ... well, there's a reason they're no longer in vogue,
except in specialized circumstances. Maybe your situation warrants such an
approach.
Using data, instead of pointers, isn't just a naive implementation of Codd's
definition of a relational database management system, it's also a performance
enhancer in a lot of cases. You can't efficiently join between indirectly-
related tables using only pointers. You can't efficiently find all records with
a foreign key value between two other values, without referring back to the
parent table -- essentially forcing you to do a join where otherwise you
wouldn't have to. In fact, any operation on a child table using such a field
would require looking up the parent record to extract the field value -- even
if all you want to do is log a field's value during a trigger, etc. You might
be able to store both -- the pointer and the value -- but why? Most of the
time, the index + page cache makes that moot. And the other way around, having
a parent keep a pointer to the first child, etc. so it can even-more-
efficiently find all of its children (not a true relational concept, but I'll
pretend) would have other implementation consequences when dealing with
multiple independent transactions inserting children simultaneously. And in
what order would the child pointers be stored? Do you have to navigate the
whole chain of children (linked list), when searching by FK + other field? That
could be highly inefficient, compared to an indexed lookup (which you could
plan ahead for with a multi-field index!) Pointers also, in general, make it
harder for a DB to do a "re-org", defragmenting the space. They would also make
MVCC somewhat more difficult, as records would be left pointing to record
versions that, when cleaned up by the garbage collector, would then need a
cascade-update of pointer values, which is exactly what you're wanting to avoid.
Summary: there are trade-offs you're not seeing. I'm sorry.
For further discussion on that topic, I recommend the firebird-architect list.
I'm sure the guys (and gals) there would be happy to explain the details to
you. Also, if you do find a true relational database system that uses pointers,
please let us know, that could be interesting.
There are a few cases where Firebird makes pointers available to you -- though
only valid for the duration of the current transaction, rdb$db_key values are
essentially pointers, and can be stored in temp tables, used to lookup records
faster, etc.
-Philip
            From: "sqlsvr" <sqlsvr@...>
[snipped into pieces]
------- End of Original Message -------
> How is firebird cascade performance on a very large database (sayCascades are performed through index lookups. They run at the same speed as the
> 100gb) with lots of relations (especially when using natural keys)?
> I'm assuming it would take hours/days and probably block all other
> queries.
equivalent manually-written triggers, or manually-written statements. Feel free
to test it for yourself: how fast it runs will depend on how many rows you're
updating (in total), cache size, CPU, disk, ... which is not something we can
tell you.
The number of relations only matters for the number of relations affected by
the cascade, and the selectivity of the indices in each. Again, your details
will vary.
By "natural key", I assume you mean non-surrogate, probably non-integer? The
indices work the same way regardless; the same prefix compression takes place,
the same selectivity statistics matter, etc. Only relevant if your natural key
would have significantly poorer index selectivity, or much larger size than
some sort of surrogate. Details?
Are you doing a cascade-delete, or a cascade-update? If it's a cascade-update
on this "natural key", and you want to avoid the cost, then don't use that as
the PK/FK. Use it as a not-null unique field (a candidate key) and use an
unchanging surrogate key for your PK/FK. You can change this value without
cascade-updates, and you can decide when you want to fetch it by using an
explicit join.
Firebird, unlike a lot of databases, makes it fairly easy (or even possible) to
cascade-update a primary key value, but that doesn't mean it's something you
should be doing on a very regular basis. Remember that logically, a cascade-
update says "hey, I got this row's identity wrong, let me fix it". That's not
the kind of thing that should be happening every day. And while a database can
fix those references for you internally, it can't fix any external system to
which you may have announced that PK value -- so to play nice with outside
systems, it's a good thing to avoid when possible.
Such a cascade would only block other queries that attempted to modify the same
records your query was modifying, or attempted to insert records referring to
them. Other queries, especially read-only queries, would be unaffected. This is
a benefit of MVCC, if you're coming from an environment where you're used to
writers blocking readers, or writers taking more locks than they deserve (sql-
server, for example, likes to upgrade to page or table locks when it
shouldn't.) It's also easier to tune than in, say, Oracle, where you'd have to
fiddle with the transaction slot count (INITRANS, ITL) to avoid issues with
unrelated updates blocking each other.
> How would you delete/update a record that would cause lots of cascades?As to how to do this -- easy. Delete some records with lots of dependencies
elsewhere. What exactly were you meaning to ask?
> If firebird used internal pointers for the foreign keys, itAnd again with the pointers. *sigh*. What is it with everyone and the pointers?
> would be instant instead of hours/days.
I'm feeding the trolls at this point, but ...
If you want a "database" that uses pointers, I recommend you look at any of the
so-called post-relational databases (aka pre-relational, hierarchical, CODASYL,
MUMPS, or network databases.) They use pointers internally. They also require
manual navigation, and are easily corrupted, have more problems with multi-user
modifications, and ... well, there's a reason they're no longer in vogue,
except in specialized circumstances. Maybe your situation warrants such an
approach.
Using data, instead of pointers, isn't just a naive implementation of Codd's
definition of a relational database management system, it's also a performance
enhancer in a lot of cases. You can't efficiently join between indirectly-
related tables using only pointers. You can't efficiently find all records with
a foreign key value between two other values, without referring back to the
parent table -- essentially forcing you to do a join where otherwise you
wouldn't have to. In fact, any operation on a child table using such a field
would require looking up the parent record to extract the field value -- even
if all you want to do is log a field's value during a trigger, etc. You might
be able to store both -- the pointer and the value -- but why? Most of the
time, the index + page cache makes that moot. And the other way around, having
a parent keep a pointer to the first child, etc. so it can even-more-
efficiently find all of its children (not a true relational concept, but I'll
pretend) would have other implementation consequences when dealing with
multiple independent transactions inserting children simultaneously. And in
what order would the child pointers be stored? Do you have to navigate the
whole chain of children (linked list), when searching by FK + other field? That
could be highly inefficient, compared to an indexed lookup (which you could
plan ahead for with a multi-field index!) Pointers also, in general, make it
harder for a DB to do a "re-org", defragmenting the space. They would also make
MVCC somewhat more difficult, as records would be left pointing to record
versions that, when cleaned up by the garbage collector, would then need a
cascade-update of pointer values, which is exactly what you're wanting to avoid.
Summary: there are trade-offs you're not seeing. I'm sorry.
For further discussion on that topic, I recommend the firebird-architect list.
I'm sure the guys (and gals) there would be happy to explain the details to
you. Also, if you do find a true relational database system that uses pointers,
please let us know, that could be interesting.
There are a few cases where Firebird makes pointers available to you -- though
only valid for the duration of the current transaction, rdb$db_key values are
essentially pointers, and can be stored in temp tables, used to lookup records
faster, etc.
-Philip