Subject Database design comments
Author Luis Carlos Junges
hi,

I would like to receive some comments about this database design for this project. Here is what i have:


I am building a system that will be able to compare electrical brushes (graphite electrical brush).
Basically the comparison will give the level of similarity and also highlight the differences between two brushes.

The main problem is the quantity of brushes; i am talking about 500000 brushes. So, i have to make 125 billions of comparisons (combinatory analysis C500000,2)
in order to know the level of similarity among brushes. Doing those calculus in real time is impossible because is to much time consuming. My idea, in this
case, is to make those comparisons previously and store them in Firebird database. Well, now the problem begins:

* I will stored something like 100 billions entries (some comparisons does not need to be stored - similarity equal to zero for example)
* I have to store a reference for the two brushes and also the level of similarity. I create a table like this

|brush1(integer) | brush2(integer) | similarity (smallint) |
|-----------------------------------------------------------------------------------|
| 500000 | 34999 | 98 |
| 365 | 34999 | 99 |
| 1000 | 349 | 56 |
| 1000 | 365 | 75 |


so the size in bytes for each tuple will be, according to description in firebird website, 10 bytes.
In fact, i just need 7 bytes but i do not have an integer with 1 byte(for similarity) type and an integer with 3 bytes(for brush1 and brush2) in firebird.

So, taking in account the size of each tuple and the amount of entries, the database size will be 935 GigaiBytes (almost 1 TeraiByte). Correct me if i am wrong,
but i think firebird creates hash tables or something similar to optimize the
search inside the database; so the database size will be higher than 935 GigaiBytes. Probably i will beat the actual 980 GB database written in firebird web site.
I created one firebird database with only this table inside and with 1 million entries; the size is by calculus is 9.54 MegaiBytes but the real size is 70 MegaIbytes.
So, i think this is a linear increasing of 13.35%. Am i right? This is linear? How would be the size for 100 Billions entries? Doing according to linear
increase, the real database size would be 1.06 TeraiBytes.

* Windows system does not accept file size higher than 32 GB. Linux systems accept over 64 TB. So, which one is better? Use multiple database files in Windows
or a unique file in Unix like systems? I will choose that one that is more fast to retrieve data using SELECT. Take in mind if i use multiple files i will have to search in multiples files too.
I do not if firebird has tool to manage multiple files and keep multiple database syncronized and also i have no idea how fast it is.

* Another concern that i have is how many threads can i have to create the first database? I have no idea how to calculate the ideal number of threads that can
be running together when the database is being created. If i put a single is too much time, if i put several threads (5000 for example), probably the firebird
engine will not be able to process them because a lot of collisions will happen.







---
Luís Carlos Dill Junges ©

"A realidade de cada lugar e de cada época é uma alucinação coletiva."
Bloom, Howard










---------------------------------
Abra sua conta no Yahoo! Mail, o único sem limite de espaço para armazenamento!

[Non-text portions of this message have been removed]