Subject | Re: feature request: materialized views |
---|---|
Author | paulruizendaal |
Post date | 2004-09-21T22:18:01Z |
Dear architects,
My understanding is that Oracle materialised views are a combination
of three different things, that combine to something quite nifty:
- replication
- precalculated aggregates/joins
- query rewrite
In my opinion the real intelligence is in the query rewrite, which is
far removed from the DDL syntax and update policies that people seem
to think about first when discussing materialised views.
From the Oracle data warehousing manual:
<quote>Typically, data flows from one or more online transaction
processing (OLTP) databases into a data warehouse on a monthly,
weekly, or daily basis. The data is normally processed in a staging
file before being added to the data warehouse. Data warehouses
commonly range in size from tens of gigabytes to a few terabytes.
Usually, the vast majority of the data is stored in a few very large
fact tables.
One technique employed in data warehouses to improve performance is
the creation of summaries. Summaries are special kinds of aggregate
views that improve query execution times by precalculating expensive
joins and aggregation operations prior to execution and storing the
results in a table in the database. For example, you can create a
table to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and in
literature on data warehousing are created in Oracle using a schema
object called a materialized view. Materialized views can perform a
number of roles, such as improving query performance or providing
replicated data.
Prior to Oracle8i, organizations using summaries spent a significant
amount of time and effort creating summaries manually, identifying
which summaries to create, indexing the summaries, updating them, and
advising their users on which ones to use. The introduction of
summary management in Oracle8i eased the workload of the database
administrator and meant the user no longer needed to be aware of the
summaries that had been defined. The database administrator creates
one or more materialized views, which are the equivalent of a
summary. The end user queries the tables and views at the detail data
level. The query rewrite mechanism in the Oracle server automatically
rewrites the SQL query to use the summary tables. This mechanism
reduces response time for returning results from the query.
Materialized views within the data warehouse are transparent to the
end user or to the database application. Although materialized views
are usually accessed through the query rewrite mechanism, an end user
or database application can construct queries that directly access
the summaries. However, serious consideration should be given to
whether users should be allowed to do this because any change to the
summaries will affect the queries that reference them.</quote>
Replication
===========
Replication is, I suppose, the most widely understood of the elements
involved in materialised views. Several Firebird VAR's offer
replication solutions for Firebird and I assume that one way or
another these solutions can be used to mimic the behaviours of Oracle
materialised views on remote tables -- if that is what we want.
Oracle's implementation looks straightforward: for tables that
participate in replication, change logs are kept. The logging is
defined with the CREATE MATERIALISED VIEW LOG ddl syntax.
In order to replicate, Oracle needs a unique id for each replicated
table's rows. It uses either the primary key (if there is one) or the
rowid (similar to our DB_KEY). It has a mechanism to require a full
refresh if the replication id is rowid based and a db backup/restore
has occurred. The log can keep a sequence number in the log "to
facilitate certain types of propagation". This seems to be relevant
to transactions containing multiple, mixed DML statements that must
be replayed in the right sequence.
Precalculated aggregates/joins
==============================
A precalculated materialised view is basically a select result set
that has been stored in a physical table, much like the Microsoft
meaning of SELECT ... INTO. The interesting bit comes with the
refresh of such a table if the original result changes due to a
change in the underlying tables. Two things are relevant:
- how will the updates be propagated
- when will the updates be propagated
Oracle offers two versions for the "how?" part, FAST and COMPLETE.
Complete seems simple, the system empties the materialised view
table, reruns the query and stores the result set. For some sorts of
query this is the only option (e.g. queries using analytical
functions, or using non-repeatable elements, such as SYSDATE).
Fast is clever and uses the replication mechanisms to track the
changes made to the underlying tables and propagate these to the
materialised view ("internal replication", so to speak):
aggregates
----------
<quote>In data warehouses, materialized views normally contain
aggregates. For fast refresh to be possible, the SELECT list must
contain all of the GROUP BY columns (if present), and there must be a
COUNT(*) and a COUNT(column) on any aggregated columns. Also,
materialized view logs must be present on all tables referenced in
the query that defines the materialized view. The valid aggregate
functions are: SUM, COUNT(x), COUNT(*), AVG, VARIANCE, STDDEV, MIN,
and MAX, and the expression to be aggregated can be any SQL value
expression.</quote>
The restrictions seem logical: one needs COUNT data in order to be
able to propagate detail record updates to AVG, STDDEV, etc. columns.
I'm a bit miffed how detail updates to MIN/MAX columns are done.
joins
-----
<quote>Some materialized views contain only joins and no aggregates.
The advantage of creating this type of materialized view is that
expensive joins will be precalculated.
If you specify REFRESH FAST, Oracle performs further verification of
the query definition to ensure that fast refresh can be performed if
any of the detail tables change. These additional checks are:
- A materialized view log must be present for each detail table.
- The rowids of all the detail tables must appear in the SELECT list
of the materialized view query definition.
- If there are no outer joins, you may have arbitrary selections and
joins in the WHERE clause. However, if there are outer joins, the
WHERE clause cannot have any selections. Further, if there are outer
joins, all the joins must be connected by ANDs and must use the
equality (=) operator.</quote>
The first requirement seems unavoidable. The second seems
implementational convenience: surely including the columns of a
unique key would be sufficient. The third I cannot fathom (yet).
Oracle offers three solutions to the "when?" bit:
- after each change (ON COMMIT)
- scheduled (START WITH .. NEXT ..)
- by invoking a special, built-in PL/SQL procedure (ON DEMAND)
The ON COMMIT option is only practical when combined with FAST
propagation; I believe Oracle blocks the ON COMMIT / COMPLETE combo.
Query rewrite
=============
<quote>The query optimizer automatically recognizes when an existing
materialized view can and should be used to satisfy a request. It
then transparently rewrites the request to use the materialized view.
Queries go directly to the materialized view and not to the
underlying detail tables. In general, rewriting queries to use
materialized views rather than detail tables improves
response.</quote>
According to the figure in the text, the optimiser generates plans
for both direct access and for using appropiate materialised views
and picks the best option.
I would welcome input on the algorithms that such an optimiser might
use. I guess it could work by matching (sub-) RSE trees with all
available materialised views in a combination of hashing ("is it
based on the same base tables?") and brute force ("does it match?").
Note that implementing query rewriting delivers 80% of the
materialised view benefit to the end user, because IMHO doing a
COMPLETE refresh on the summary tables, say every night, will do just
fine in most cases.
Paul
PS For those who wonder: no, I am not implementing this -- it is
*way* down on my "things I would like to do" list
--- In Firebird-Architect@yahoogroups.com, "Alexander Klenin"
<ask@i...> wrote:
My understanding is that Oracle materialised views are a combination
of three different things, that combine to something quite nifty:
- replication
- precalculated aggregates/joins
- query rewrite
In my opinion the real intelligence is in the query rewrite, which is
far removed from the DDL syntax and update policies that people seem
to think about first when discussing materialised views.
From the Oracle data warehousing manual:
<quote>Typically, data flows from one or more online transaction
processing (OLTP) databases into a data warehouse on a monthly,
weekly, or daily basis. The data is normally processed in a staging
file before being added to the data warehouse. Data warehouses
commonly range in size from tens of gigabytes to a few terabytes.
Usually, the vast majority of the data is stored in a few very large
fact tables.
One technique employed in data warehouses to improve performance is
the creation of summaries. Summaries are special kinds of aggregate
views that improve query execution times by precalculating expensive
joins and aggregation operations prior to execution and storing the
results in a table in the database. For example, you can create a
table to contain the sums of sales by region and by product.
The summaries or aggregates that are referred to in this book and in
literature on data warehousing are created in Oracle using a schema
object called a materialized view. Materialized views can perform a
number of roles, such as improving query performance or providing
replicated data.
Prior to Oracle8i, organizations using summaries spent a significant
amount of time and effort creating summaries manually, identifying
which summaries to create, indexing the summaries, updating them, and
advising their users on which ones to use. The introduction of
summary management in Oracle8i eased the workload of the database
administrator and meant the user no longer needed to be aware of the
summaries that had been defined. The database administrator creates
one or more materialized views, which are the equivalent of a
summary. The end user queries the tables and views at the detail data
level. The query rewrite mechanism in the Oracle server automatically
rewrites the SQL query to use the summary tables. This mechanism
reduces response time for returning results from the query.
Materialized views within the data warehouse are transparent to the
end user or to the database application. Although materialized views
are usually accessed through the query rewrite mechanism, an end user
or database application can construct queries that directly access
the summaries. However, serious consideration should be given to
whether users should be allowed to do this because any change to the
summaries will affect the queries that reference them.</quote>
Replication
===========
Replication is, I suppose, the most widely understood of the elements
involved in materialised views. Several Firebird VAR's offer
replication solutions for Firebird and I assume that one way or
another these solutions can be used to mimic the behaviours of Oracle
materialised views on remote tables -- if that is what we want.
Oracle's implementation looks straightforward: for tables that
participate in replication, change logs are kept. The logging is
defined with the CREATE MATERIALISED VIEW LOG ddl syntax.
In order to replicate, Oracle needs a unique id for each replicated
table's rows. It uses either the primary key (if there is one) or the
rowid (similar to our DB_KEY). It has a mechanism to require a full
refresh if the replication id is rowid based and a db backup/restore
has occurred. The log can keep a sequence number in the log "to
facilitate certain types of propagation". This seems to be relevant
to transactions containing multiple, mixed DML statements that must
be replayed in the right sequence.
Precalculated aggregates/joins
==============================
A precalculated materialised view is basically a select result set
that has been stored in a physical table, much like the Microsoft
meaning of SELECT ... INTO. The interesting bit comes with the
refresh of such a table if the original result changes due to a
change in the underlying tables. Two things are relevant:
- how will the updates be propagated
- when will the updates be propagated
Oracle offers two versions for the "how?" part, FAST and COMPLETE.
Complete seems simple, the system empties the materialised view
table, reruns the query and stores the result set. For some sorts of
query this is the only option (e.g. queries using analytical
functions, or using non-repeatable elements, such as SYSDATE).
Fast is clever and uses the replication mechanisms to track the
changes made to the underlying tables and propagate these to the
materialised view ("internal replication", so to speak):
aggregates
----------
<quote>In data warehouses, materialized views normally contain
aggregates. For fast refresh to be possible, the SELECT list must
contain all of the GROUP BY columns (if present), and there must be a
COUNT(*) and a COUNT(column) on any aggregated columns. Also,
materialized view logs must be present on all tables referenced in
the query that defines the materialized view. The valid aggregate
functions are: SUM, COUNT(x), COUNT(*), AVG, VARIANCE, STDDEV, MIN,
and MAX, and the expression to be aggregated can be any SQL value
expression.</quote>
The restrictions seem logical: one needs COUNT data in order to be
able to propagate detail record updates to AVG, STDDEV, etc. columns.
I'm a bit miffed how detail updates to MIN/MAX columns are done.
joins
-----
<quote>Some materialized views contain only joins and no aggregates.
The advantage of creating this type of materialized view is that
expensive joins will be precalculated.
If you specify REFRESH FAST, Oracle performs further verification of
the query definition to ensure that fast refresh can be performed if
any of the detail tables change. These additional checks are:
- A materialized view log must be present for each detail table.
- The rowids of all the detail tables must appear in the SELECT list
of the materialized view query definition.
- If there are no outer joins, you may have arbitrary selections and
joins in the WHERE clause. However, if there are outer joins, the
WHERE clause cannot have any selections. Further, if there are outer
joins, all the joins must be connected by ANDs and must use the
equality (=) operator.</quote>
The first requirement seems unavoidable. The second seems
implementational convenience: surely including the columns of a
unique key would be sufficient. The third I cannot fathom (yet).
Oracle offers three solutions to the "when?" bit:
- after each change (ON COMMIT)
- scheduled (START WITH .. NEXT ..)
- by invoking a special, built-in PL/SQL procedure (ON DEMAND)
The ON COMMIT option is only practical when combined with FAST
propagation; I believe Oracle blocks the ON COMMIT / COMPLETE combo.
Query rewrite
=============
<quote>The query optimizer automatically recognizes when an existing
materialized view can and should be used to satisfy a request. It
then transparently rewrites the request to use the materialized view.
Queries go directly to the materialized view and not to the
underlying detail tables. In general, rewriting queries to use
materialized views rather than detail tables improves
response.</quote>
According to the figure in the text, the optimiser generates plans
for both direct access and for using appropiate materialised views
and picks the best option.
I would welcome input on the algorithms that such an optimiser might
use. I guess it could work by matching (sub-) RSE trees with all
available materialised views in a combination of hashing ("is it
based on the same base tables?") and brute force ("does it match?").
Note that implementing query rewriting delivers 80% of the
materialised view benefit to the end user, because IMHO doing a
COMPLETE refresh on the summary tables, say every night, will do just
fine in most cases.
Paul
PS For those who wonder: no, I am not implementing this -- it is
*way* down on my "things I would like to do" list
--- In Firebird-Architect@yahoogroups.com, "Alexander Klenin"
<ask@i...> wrote:
> >From: "Ann W. Harrison" <aharrison@i...>answer to
> > At 12:28 PM 9/16/2004, Alexander Klenin wrote:
> >>I am talking about SNAPSHOTs aka MATERIALIZED VIEWs. The syntax
> >>(simplified from Oracle):
>
> > Probably this issue should be discussed on the
> >architecture list rather than here.
> Ok.
>
>
> > My short answer is that materialized views are probably not the
> > the question of how to distill data because they affect theglobal name
> > space,
> The issue of namespace is IMO completely orthogonal to the
> concept of cached view. It can be implemented as either
> connection-local-temporary or database-global-permanent,
> or both.
>
> > and because their are either performance burdens
> >to the underlying tables, or horribly inconsistent.
> Well, the fact that Oracle's implementation is perhaps
> suboptimal does not, in itself, mean that Firebird is
> unable to do better ;)
> More to the point, I have used Oracle with some
> respectable-sized databases and complex enough queries,
> and I must say that, when applied correctly, MATERIALIZED
> views can make a difference between an interactive
> analytic application (responce time in seconds) and batch
> report-generation one (response time in hours).
> And all that with just a single word ;)
>
> No doubt, this is not cure-all fit-all solution, but an
> important tool for the right circumstances.
>
> ---
> Professional hosting for everyone - http://www.host.ru