Subject RE: [firebird-support] Database design comments
Author Svein Erling Tysvær
Why creating that huge a table? My guess is that out of those 125 billion comparisons, the vast majority will never be interesting.

I regularly work with tables of one or two million rows (the biggest being 15 million rows) that I join together. Normally, I've no problem joining three such tables together, but that is partially because the fields I join on are very selective and because I want a reasonably small result set (OK, it does happen that I want one million rows, but normally I'm content with a few thousand).

Occasionally I do something that I think can be compared to what you want to do - I have a set of persons that are cases and shall find another set of persons with similar characteristics (matching on gender and different dates) that can serve as controls (normally choosing randomly amongst those that fulfill my criteria, but if you want all that match, it just makes the procedure simpler). Admittedly, the main tables I normally use for such matching only contains 100000-300000 records.

Whenever I do such matching, I just use the tables as they are (with decent indexing, of course) and only extract the records that I'm actually interested in. Partly due to the 'choosing randomly' part, and partly due to not wanting to slow down the system for other users too much, I do go through each person separately and each person may in some cases take a few seconds. I would have avoided a 'prematched' table containing lots of combinations, if possible. I'd say similarity is something best handled through use of indexes... I simply doubt that anyone would ever want to know that 'these two brushes has a similarity of 75', when there are several hundred (thousand) brushes that are a lot more alike. And if one of the brushes changed in a very small way, then you might have to update thousands of records.

Hence, I'd recommend you to do such matching only when required. The actual matching will take some time, depending on your situation it could be a few milliseconds or several hours. Having the huge table you envision, is something I would only consider after having found (and verified that it wasn't due to a stupid mistake I did) that matching upon request actually was too slow.

You say that you have some criteria that always have to match. Creating an index that contains these fields may prove very useful (depending on the selectivity of the fields, an evenly distributed 'Boolean' field isn't a good candidate for any index).

HTH,
Set

-----Original Message-----
From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroups.com] On Behalf Of Luis Carlos Junges
Sent: 12. juni 2008 02:00
To: firebird-support@yahoogroups.com
Subject: [firebird-support] Database design comments

hi,

I would like to receive some comments about this database design for this project. Here is what i have:


I am building a system that will be able to compare electrical brushes (graphite electrical brush).
Basically the comparison will give the level of similarity and also highlight the differences between two brushes.

The main problem is the quantity of brushes; i am talking about 500000 brushes. So, i have to make 125 billions of comparisons (combinatory analysis C500000,2)
in order to know the level of similarity among brushes. Doing those calculus in real time is impossible because is to much time consuming. My idea, in this
case, is to make those comparisons previously and store them in Firebird database. Well, now the problem begins:

* I will stored something like 100 billions entries (some comparisons does not need to be stored - similarity equal to zero for example)
* I have to store a reference for the two brushes and also the level of similarity. I create a table like this

|brush1(integer) | brush2(integer) | similarity (smallint) |
|-----------------------------------------------------------------------------------|
| 500000 | 34999 | 98 |
| 365 | 34999 | 99 |
| 1000 | 349 | 56 |
| 1000 | 365 | 75 |


so the size in bytes for each tuple will be, according to description in firebird website, 10 bytes.
In fact, i just need 7 bytes but i do not have an integer with 1 byte(for similarity) type and an integer with 3 bytes(for brush1 and brush2) in firebird.

So, taking in account the size of each tuple and the amount of entries, the database size will be 935 GigaiBytes (almost 1 TeraiByte). Correct me if i am wrong,
but i think firebird creates hash tables or something similar to optimize the
search inside the database; so the database size will be higher than 935 GigaiBytes. Probably i will beat the actual 980 GB database written in firebird web site.
I created one firebird database with only this table inside and with 1 million entries; the size is by calculus is 9.54 MegaiBytes but the real size is 70 MegaIbytes.
So, i think this is a linear increasing of 13.35%. Am i right? This is linear? How would be the size for 100 Billions entries? Doing according to linear
increase, the real database size would be 1.06 TeraiBytes.

* Windows system does not accept file size higher than 32 GB. Linux systems accept over 64 TB. So, which one is better? Use multiple database files in Windows
or a unique file in Unix like systems? I will choose that one that is more fast to retrieve data using SELECT. Take in mind if i use multiple files i will have to search in multiples files too.
I do not if firebird has tool to manage multiple files and keep multiple database syncronized and also i have no idea how fast it is.

* Another concern that i have is how many threads can i have to create the first database? I have no idea how to calculate the ideal number of threads that can
be running together when the database is being created. If i put a single is too much time, if i put several threads (5000 for example), probably the firebird
engine will not be able to process them because a lot of collisions will happen.

---
Luís Carlos Dill Junges ©