Subject RE: [firebird-support] Database design comments
Author Luis Carlos Junges
Hey Svein,

So, the 125 billions comparisons WILL BE MADE because i have no idea right now how brushes are similar. However the database will not have 125 billions entries. Like you said, anyone would ever want to know brushes with similarity lower than 90 %. So, just selecting those ones with similarity higher than 90%, i have no good guess now, but in the worst case it will have something like 40 Billions entries (i think). Whatever, still is higher than 1 billion, this is the reason i am posting here to see more ideas of this design and how to change or improve it. I do not like the idea of having more than 1 billions entries for this.

Doing in real time like you said i think is not possible because 200 clients will be accessing the records at the same time. Doing the calculus for each one will take more than 3 seconds. They do not want more than 3 seconds to see the results. Take in mind that this is a system that will be able to identify competitor brushes and my company brushes in a quick way, when the client is still on the phone. That is the objective of the project. So, 200 clients asking results to the system and each one needs a result in less than 3 seconds!

So, doing a prematched table (billions entries) would resolve my problem of 3 seconds to show the result. If there is a competitor brush similar with our company, the result will appear in less than 3 seconds. Now, i am not quite sure if this is the best solution.

The system has a search criteria that is filled by the user. With this search criteria i find that possible brushes index in the brush database. After finding those indexes, i search similarities in the my database.

Well, in fact i did not do it in real time to see how much time it takes because right no i am working on the GUI. I am not ready with this yet. If the times is quick reasonable it will be excellent. Its sounds stupid create a table with billions of entries, but i am trying to think the best solution to increase the speed.

Svein Erling Tysvær <svein.erling.tysvaer@...> escreveu: Why creating that huge a table? My guess is that out of those 125 billion comparisons, the vast majority will never be interesting.

I regularly work with tables of one or two million rows (the biggest being 15 million rows) that I join together. Normally, I've no problem joining three such tables together, but that is partially because the fields I join on are very selective and because I want a reasonably small result set (OK, it does happen that I want one million rows, but normally I'm content with a few thousand).

Occasionally I do something that I think can be compared to what you want to do - I have a set of persons that are cases and shall find another set of persons with similar characteristics (matching on gender and different dates) that can serve as controls (normally choosing randomly amongst those that fulfill my criteria, but if you want all that match, it just makes the procedure simpler). Admittedly, the main tables I normally use for such matching only contains 100000-300000 records.

Whenever I do such matching, I just use the tables as they are (with decent indexing, of course) and only extract the records that I'm actually interested in. Partly due to the 'choosing randomly' part, and partly due to not wanting to slow down the system for other users too much, I do go through each person separately and each person may in some cases take a few seconds. I would have avoided a 'prematched' table containing lots of combinations, if possible. I'd say similarity is something best handled through use of indexes... I simply doubt that anyone would ever want to know that 'these two brushes has a similarity of 75', when there are several hundred (thousand) brushes that are a lot more alike. And if one of the brushes changed in a very small way, then you might have to update thousands of records.

Hence, I'd recommend you to do such matching only when required. The actual matching will take some time, depending on your situation it could be a few milliseconds or several hours. Having the huge table you envision, is something I would only consider after having found (and verified that it wasn't due to a stupid mistake I did) that matching upon request actually was too slow.

You say that you have some criteria that always have to match. Creating an index that contains these fields may prove very useful (depending on the selectivity of the fields, an evenly distributed 'Boolean' field isn't a good candidate for any index).


-----Original Message-----
From: [] On Behalf Of Luis Carlos Junges
Sent: 12. juni 2008 02:00
Subject: [firebird-support] Database design comments


I would like to receive some comments about this database design for this project. Here is what i have:

I am building a system that will be able to compare electrical brushes (graphite electrical brush).
Basically the comparison will give the level of similarity and also highlight the differences between two brushes.

The main problem is the quantity of brushes; i am talking about 500000 brushes. So, i have to make 125 billions of comparisons (combinatory analysis C500000,2)
in order to know the level of similarity among brushes. Doing those calculus in real time is impossible because is to much time consuming. My idea, in this
case, is to make those comparisons previously and store them in Firebird database. Well, now the problem begins:

* I will stored something like 100 billions entries (some comparisons does not need to be stored - similarity equal to zero for example)
* I have to store a reference for the two brushes and also the level of similarity. I create a table like this

|brush1(integer) | brush2(integer) | similarity (smallint) |
| 500000 | 34999 | 98 |
| 365 | 34999 | 99 |
| 1000 | 349 | 56 |
| 1000 | 365 | 75 |

so the size in bytes for each tuple will be, according to description in firebird website, 10 bytes.
In fact, i just need 7 bytes but i do not have an integer with 1 byte(for similarity) type and an integer with 3 bytes(for brush1 and brush2) in firebird.

So, taking in account the size of each tuple and the amount of entries, the database size will be 935 GigaiBytes (almost 1 TeraiByte). Correct me if i am wrong,
but i think firebird creates hash tables or something similar to optimize the
search inside the database; so the database size will be higher than 935 GigaiBytes. Probably i will beat the actual 980 GB database written in firebird web site.
I created one firebird database with only this table inside and with 1 million entries; the size is by calculus is 9.54 MegaiBytes but the real size is 70 MegaIbytes.
So, i think this is a linear increasing of 13.35%. Am i right? This is linear? How would be the size for 100 Billions entries? Doing according to linear
increase, the real database size would be 1.06 TeraiBytes.

* Windows system does not accept file size higher than 32 GB. Linux systems accept over 64 TB. So, which one is better? Use multiple database files in Windows
or a unique file in Unix like systems? I will choose that one that is more fast to retrieve data using SELECT. Take in mind if i use multiple files i will have to search in multiples files too.
I do not if firebird has tool to manage multiple files and keep multiple database syncronized and also i have no idea how fast it is.

* Another concern that i have is how many threads can i have to create the first database? I have no idea how to calculate the ideal number of threads that can
be running together when the database is being created. If i put a single is too much time, if i put several threads (5000 for example), probably the firebird
engine will not be able to process them because a lot of collisions will happen.

Luís Carlos Dill Junges ©

Luís Carlos Dill Junges ©

"A realidade de cada lugar e de cada época é uma alucinação coletiva."
Bloom, Howard

Fale com seus amigos de graça com o novo Yahoo! Messenger

[Non-text portions of this message have been removed]