Subject | Re: [firebird-support] Re: Long table or many tables? |
---|---|
Author | Helen Borrie |
Post date | 2009-10-09T22:35:41Z |
At 06:04 AM 10/10/2009, you wrote:
A data page can store only records from one table. How many records appear on a page depends on the size of the records. If you have blobs in the record, they will be stored on the same page as the other data if there is room; otherwise, they will be stored on one or more other pages, called blob pages. Either way, the blob's field in the metadata definition contains a blob_id that the engine uses to find and reference blobs.
A large record (one that is bigger than the page_size) will be stored across multiple data pages.
There are other types of pages, too, including index pages. All pages have headers that contain pointers to the addresses of other pages the engine has to find in relation to this one. For a data page, this includes the address of the "next" record in the table, which could be anywhere. Pages don't need to be stored in any special numerical order. Eventually, especially if the housekeeping is poor, the pages for a particular table may be distributed all over the disk. This is true whether the table has a lot of rows or a few.
Firebird uses multi-version architecture (MVA). That means multiple versions of the same record can exist on disk concurrently. The engine will try to write a new record version on the same page as the one containing the original version. If there is not enough space, it will next look for a data page that has been recycled from a garbage cleanup and write the new recversion there. If it can't find that, it gets the operating system to give it a new page_size block of disk.
In a well-tempered database, where garbage collection is occurring regularly, this recycling is efficient. A record version becomes eligible to be garbage when another record becomes the latest committed version AND no other transactions in the global transaction inventory are interested in that record version. If housekeeping isn't happening, or isn't happening often enough, the pages get bogged up with old recversions and it becomes more and more of a job for the engine to find the record you asked for.
Poor attention to GC is by far the most common source of the "slowdown" you described. The fact that the table is "big" doesn't make it slow, per se. A small table will suffer from the same problem if you're not allowing GC to happen. Of course, another side effect of this is that the database file exhibits physical growth that is far in excess of the volume of committed data.
A "dataset" is a buzzword injected by a programming tool. It refers (usually) to a set of one or more rows of output from a SELECT query. Depending on which vendor wrote your programming tool, it might be referred to as 'recordset' or something even more bizarre. Some tools might call the same thing a 'table'.
The concept of "users paging through the data in order" is right for spreadsheets and totally wrong for database applications. The point of using a RDBMS is so that you can show users *only* the data they want, without having to "page through" lots of data they don't want. Your application gives the user "drill-down" mechanisms to specify (or select) only the data they want right now.
This design process is called "normalisation". If the "master" data changes somewhere in the chain, it is changed in one and only one place. It is this capability that is the reason you have been asked to abstract your spreadsheet-style data into structures that facilitate the output of "sets" with an infinite number of possible combinations that might or might not be known right now.
Apart from the PK and FK constraints, which have mandatory indexes that are created automatically, you will be interested in indexing columns that are going to be involved in search specifications, most notably joins and WHERE clauses. There is rarely any need to define multi-column indexes: the optimiser has no use for them so they just contribute to background clutter.
There is often a need to define *single* column indexes because they can be used to optimise searches if they would be useful. Some single column indexes are not just useless, but might be selected by the optimiser just because they are there, even though, without it, the optimiser could use a more efficient method to make the search.
The other factor - the progressive slowdown - has to be addressed in the applications, specifically, the way they manage transactions. If that is beyond them for now, then **all of you** need to get acquainted with housekeeping tools (gfix for sweeps, gbak for backups that also clean up garbage). Hmmmm, saying that, you need to be acquainted with the housekeeping tools anyway.
The book is about Firebird, specifically, of course, and *that* alone made it the 2KG doorstopper that it is. Although I hope it makes the reader aware that relational database design is important, it is not a comprehensive primer for relational database design. The books listed in Appendix XII will help you a lot in grokking the abstractions. The Simsion book is particularly useful and reachable. If you have trouble finding a copy, contact me privately and I'll try to put you in touch with the author, who has copies for sale.
./heLen
>Hello Helen,No, it doesn't make a difference. A FB database isn't flat files. It is a dynamically sized chunk of disk under the control of the engine. Data are laid down on "pages", whose size is determined by the page_size attribute that was set when the database was created (or, optionally, when the database was last restored). If the page is used for data then it becomes a data page for ever.
>
>> In principle, there's nothing to gain by isolating
>> each user's set in its own table.
>
>That's what I want to know. But why is there nothing to gain? I don't doubt that it is true but it would be helpful if I could better understand this. If there are n users and each user has m messages, then surely each operation has to cover nm rows. Does that make a difference?
A data page can store only records from one table. How many records appear on a page depends on the size of the records. If you have blobs in the record, they will be stored on the same page as the other data if there is room; otherwise, they will be stored on one or more other pages, called blob pages. Either way, the blob's field in the metadata definition contains a blob_id that the engine uses to find and reference blobs.
A large record (one that is bigger than the page_size) will be stored across multiple data pages.
There are other types of pages, too, including index pages. All pages have headers that contain pointers to the addresses of other pages the engine has to find in relation to this one. For a data page, this includes the address of the "next" record in the table, which could be anywhere. Pages don't need to be stored in any special numerical order. Eventually, especially if the housekeeping is poor, the pages for a particular table may be distributed all over the disk. This is true whether the table has a lot of rows or a few.
Firebird uses multi-version architecture (MVA). That means multiple versions of the same record can exist on disk concurrently. The engine will try to write a new record version on the same page as the one containing the original version. If there is not enough space, it will next look for a data page that has been recycled from a garbage cleanup and write the new recversion there. If it can't find that, it gets the operating system to give it a new page_size block of disk.
In a well-tempered database, where garbage collection is occurring regularly, this recycling is efficient. A record version becomes eligible to be garbage when another record becomes the latest committed version AND no other transactions in the global transaction inventory are interested in that record version. If housekeeping isn't happening, or isn't happening often enough, the pages get bogged up with old recversions and it becomes more and more of a job for the engine to find the record you asked for.
Poor attention to GC is by far the most common source of the "slowdown" you described. The fact that the table is "big" doesn't make it slow, per se. A small table will suffer from the same problem if you're not allowing GC to happen. Of course, another side effect of this is that the database file exhibits physical growth that is far in excess of the volume of committed data.
>Now I realise that this project is not necessarily suited to a relational database (and most of our implementations use flat-file systems), but my job in this case is to implement the system for Firebird.That doesn't make sense to me. The whole idea of putting data under the management of a relational database engine is to abstract the physical data in a way that enables the query engine to find and link it in a zillion different ways. If someone gave you the task of back-ending the system in a Firebird database, it must be because s/he needs that abstraction to make the querying possible.
>> The thing to focus on with regard to yourBe clear what you are talking about. A "table" is just an illusion of stored data that's convenient for people to think about when designing a database or specifying a query (the illusion being that a table is a grid of columns and rows).
>> problem is *why* you should be seeing slowdown
>> as data increases.
>
>You are of course correct, but I'm not surprised that operations become slower when the dataset becomes larger.
A "dataset" is a buzzword injected by a programming tool. It refers (usually) to a set of one or more rows of output from a SELECT query. Depending on which vendor wrote your programming tool, it might be referred to as 'recordset' or something even more bizarre. Some tools might call the same thing a 'table'.
>We probably will get more of an improvement out of improving housekeeping as you suggest.Let's reiterate: if you are seeing progressive slowdown then "probably" is not in the equation. Automatic housekeeping is being inhibited or blocked entirely by inattention to long-running transactions and nobody is compensating for these programming flaws by running explicit housekeeping tasks.
>My apologies for not being clear. Each row has a field for message subject, sender, recipient, a folder identifier, date, size, and several others. Each of these is indexed (ascending and descending) because users need to be able to page through the data in order. Not a good example of a relational database.If you don't normalise the data to eliminate redundancy and capture the relationships between sets of data then why bother using a RDBMS at all? Just use spreadsheets.
The concept of "users paging through the data in order" is right for spreadsheets and totally wrong for database applications. The point of using a RDBMS is so that you can show users *only* the data they want, without having to "page through" lots of data they don't want. Your application gives the user "drill-down" mechanisms to specify (or select) only the data they want right now.
>Each index is a combination of the user id and the target field. So, for example, the subject index is userid+subject, because each user only has access to their own data so all SELECT operations have a WHERE clause that includes the userid.That approach is wasteful and largely useless with Firebird. The userid should be the primary key of a structure that identifies the user uniquely. Then, more dynamic tables such as the "message" table you discuss here would be linked to that table via a foreign key on userid that references the PK of your "users" table. If one "user" has only one "folder" then that data should be in the "users" table and nowhere else. If one "user" could have more than one "folder", then there should be two other tables: "folders" defining each possible folder across the entire system; and a "user_folder" table, having its own PK [unique] and inside it, userid and folderid that are foreign keys to the respective PKs of "users" and "folders". The messages table itself would store foreign keys to refer to the actual data in the "master" tables.
This design process is called "normalisation". If the "master" data changes somewhere in the chain, it is changed in one and only one place. It is this capability that is the reason you have been asked to abstract your spreadsheet-style data into structures that facilitate the output of "sets" with an infinite number of possible combinations that might or might not be known right now.
Apart from the PK and FK constraints, which have mandatory indexes that are created automatically, you will be interested in indexing columns that are going to be involved in search specifications, most notably joins and WHERE clauses. There is rarely any need to define multi-column indexes: the optimiser has no use for them so they just contribute to background clutter.
There is often a need to define *single* column indexes because they can be used to optimise searches if they would be useful. Some single column indexes are not just useless, but might be selected by the optimiser just because they are there, even though, without it, the optimiser could use a more efficient method to make the search.
>But getting back to my question. Given that every conceivable operation will be a subset of the data specific to a user, with users (and all their messages) being added and deleted from time to time: what are the technical problems with creating one table per user?Intense, unnecessary proliferation of the same abstractions, beside the extra overhead of having all those unnecessary tables under management simultaneously. IOW, it is a source of extra overhead that you can totally avoid by improving the design.
The other factor - the progressive slowdown - has to be addressed in the applications, specifically, the way they manage transactions. If that is beyond them for now, then **all of you** need to get acquainted with housekeeping tools (gfix for sweeps, gbak for backups that also clean up garbage). Hmmmm, saying that, you need to be acquainted with the housekeeping tools anyway.
>Is it that handling many tables imposes a performance penalty?Yes.
>Is it that the data is effectively stored the same way anyway?Yes.
>Or perhaps the cache system is less efficient with this design?Yes.
>Any of this information would help me in the early stages of planning a possible replacement design in the future.I'm glad it works that way for you. Actually, all the stuff I've mentioned here is in the book. Read it up, and don't overlook the chapter about indexes and the topic about optimisation.
>
>Thanks for your help.
>
>Also, I have a copy of your book and it has been a valuable help and I think an important contribution to the community.
The book is about Firebird, specifically, of course, and *that* alone made it the 2KG doorstopper that it is. Although I hope it makes the reader aware that relational database design is important, it is not a comprehensive primer for relational database design. The books listed in Appendix XII will help you a lot in grokking the abstractions. The Simsion book is particularly useful and reachable. If you have trouble finding a copy, contact me privately and I'll try to put you in touch with the author, who has copies for sale.
./heLen