firebird-support - Re: [firebird-support] Database design comments

Subject	Re: [firebird-support] Database design comments
Author	Svein Erling Tysvaer
Post date	2008-06-12T21:17:26Z

One thing you could do, is to have a couple of 'score' fields in your
original table. Say that there were four potential values for each of
the three fields that had to match. Then, in a score field, you could
use 0, 1, 2, 3 to denote the values of the first field, 0, 4, 8, 12 to
denote the values of the second field, and 0, 16, 32, 64 to denote the
values of the third field. The score field contained the sum of these
for each record, and you could always use an equality comparison to find
relevant records. This score field I would update using BEFORE
INSERT/UPDATE triggers and not even tell the users that such a field
existed.

For fields that may or may not match and where you want to find similar
records, you may have to add several fields with various combinations
and probably use OR a bit in your queries (though always AND with the
required matches above).

Your situation is very different from mine, I don't have 200 clients
that try to run my queries. Still, I think I would rather add a few such
scoring fields (even if it ended up being 100 fields) than to have
billions of records in a table to store possible combinations for a
table that had a mere 500000 records.

Set

Luis Carlos Junges wrote:

> Hey Svein,
>
> So, the 125 billions comparisons WILL BE MADE because i have no idea right =
> now how brushes are similar. However the database will not have 125 billion=
> s entries. Like you said, anyone would ever want to know brushes with simi=
> larity lower than 90 %. So, just selecting those ones with similarity highe=
> r than 90%, i have no good guess now, but in the worst case it will have so=
> mething like 40 Billions entries (i think). Whatever, still is higher than =
> 1 billion, this is the reason i am posting here to see more ideas of this d=
> esign and how to change or improve it. I do not like the idea of having mor=
> e than 1 billions entries for this.
>
> Doing in real time like you said i think is not possible because 200 client=
> s will be accessing the records at the same time. Doing the calculus for ea=
> ch one will take more than 3 seconds. They do not want more than 3 seconds =
> to see the results. Take in mind that this is a system that will be able to=
> identify competitor brushes and my company brushes in a quick way, when th=
> e client is still on the phone. That is the objective of the project. So, 2=
> 00 clients asking results to the system and each one needs a result in less=
> than 3 seconds!=20
>
> So, doing a prematched table (billions entries) would resolve my problem o=
> f 3 seconds to show the result. If there is a competitor brush similar with=
> our company, the result will appear in less than 3 seconds. Now, i am not =
> quite sure if this is the best solution.
>
> The system has a search criteria that is filled by the user. With this sear=
> ch criteria i find that possible brushes index in the brush database. After=
> finding those indexes, i search similarities in the my database.
>
> Well, in fact i did not do it in real time to see how much time it takes b=
> ecause right no i am working on the GUI. I am not ready with this yet. If t=
> he times is quick reasonable it will be excellent. Its sounds stupid create=
> a table with billions of entries, but i am trying to think the best soluti=
> on to increase the speed.
> =20
>
> Svein Erling Tysv=E6r <svein.erling.tysvaer@...> escreveu: =
> Why creating that huge a table? My guess is that=
> out of those 125 billion comparisons, the vast majority will never be inte=
> resting.
> =20
> I regularly work with tables of one or two million rows (the biggest being=
> 15 million rows) that I join together. Normally, I've no problem joining t=
> hree such tables together, but that is partially because the fields I join =
> on are very selective and because I want a reasonably small result set (OK,=
> it does happen that I want one million rows, but normally I'm content with=
> a few thousand).
> =20
> Occasionally I do something that I think can be compared to what you want =
> to do - I have a set of persons that are cases and shall find another set o=
> f persons with similar characteristics (matching on gender and different da=
> tes) that can serve as controls (normally choosing randomly amongst those t=
> hat fulfill my criteria, but if you want all that match, it just makes the =
> procedure simpler). Admittedly, the main tables I normally use for such mat=
> ching only contains 100000-300000 records.
> =20
> Whenever I do such matching, I just use the tables as they are (with decen=
> t indexing, of course) and only extract the records that I'm actually inter=
> ested in. Partly due to the 'choosing randomly' part, and partly due to not=
> wanting to slow down the system for other users too much, I do go through =
> each person separately and each person may in some cases take a few seconds=
> . I would have avoided a 'prematched' table containing lots of combinations=
> , if possible. I'd say similarity is something best handled through use of =
> indexes... I simply doubt that anyone would ever want to know that 'these t=
> wo brushes has a similarity of 75', when there are several hundred (thousan=
> d) brushes that are a lot more alike. And if one of the brushes changed in =
> a very small way, then you might have to update thousands of records.
> =20
> Hence, I'd recommend you to do such matching only when required. The actua=
> l matching will take some time, depending on your situation it could be a f=
> ew milliseconds or several hours. Having the huge table you envision, is so=
> mething I would only consider after having found (and verified that it wasn=
> 't due to a stupid mistake I did) that matching upon request actually was t=
> oo slow.
> =20
> You say that you have some criteria that always have to match. Creating an=
> index that contains these fields may prove very useful (depending on the s=
> electivity of the fields, an evenly distributed 'Boolean' field isn't a goo=
> d candidate for any index).
> =20
> HTH,
> Set
> =20
> -----Original Message-----
> From: firebird-support@yahoogroups.com [mailto:firebird-support@yahoogroup=
> s.com] On Behalf Of Luis Carlos Junges
> Sent: 12. juni 2008 02:00
> To: firebird-support@yahoogroups.com
> Subject: [firebird-support] Database design comments
> =20
> hi,
> =20
> I would like to receive some comments about this database design for this =
> project. Here is what i have:
> =20
> I am building a system that will be able to compare electrical brushes (gr=
> aphite electrical brush).
> Basically the comparison will give the level of similarity and also highli=
> ght the differences between two brushes.
> =20
> The main problem is the quantity of brushes; i am talking about 500000 bru=
> shes. So, i have to make 125 billions of comparisons (combinatory analysis =
> C500000,2)
> in order to know the level of similarity among brushes. Doing those calcul=
> us in real time is impossible because is to much time consuming. My idea, i=
> n this
> case, is to make those comparisons previously and store them in Firebird d=
> atabase. Well, now the problem begins:
> =20
> * I will stored something like 100 billions entries (some comparisons does=
> not need to be stored - similarity equal to zero for example)
> * I have to store a reference for the two brushes and also the level of si=
> milarity. I create a table like this
> =20
> |brush1(integer) | brush2(integer) | similarity (smallint) |
> |----------------------------------------------------------|
> | 500000 | 34999 | 98 =
> |
> | 365 | 34999 | 99 =
> |
> | 1000 | 349 | 56 =
> |
> | 1000 | 365 | 75 =
> |
> =20
> so the size in bytes for each tuple will be, according to description in f=
> irebird website, 10 bytes.
> In fact, i just need 7 bytes but i do not have an integer with 1 byte(for =
> similarity) type and an integer with 3 bytes(for brush1 and brush2) in fire=
> bird.
> =20
> So, taking in account the size of each tuple and the amount of entries, th=
> e database size will be 935 GigaiBytes (almost 1 TeraiByte). Correct me if =
> i am wrong,
> but i think firebird creates hash tables or something similar to optimize =
> the
> search inside the database; so the database size will be higher than 935 G=
> igaiBytes. Probably i will beat the actual 980 GB database written in fireb=
> ird web site.
> I created one firebird database with only this table inside and with 1 mi=
> llion entries; the size is by calculus is 9.54 MegaiBytes but the real size=
> is 70 MegaIbytes.
> So, i think this is a linear increasing of 13.35%. Am i right? This is lin=
> ear? How would be the size for 100 Billions entries? Doing according to lin=
> ear
> increase, the real database size would be 1.06 TeraiBytes.
> =20
> * Windows system does not accept file size higher than 32 GB. Linux system=
> s accept over 64 TB. So, which one is better? Use multiple database files i=
> n Windows
> or a unique file in Unix like systems? I will choose that one that is more=
> fast to retrieve data using SELECT. Take in mind if i use multiple files i=
> will have to search in multiples files too.
> I do not if firebird has tool to manage multiple files and keep multiple d=
> atabase syncronized and also i have no idea how fast it is.
> =20
> * Another concern that i have is how many threads can i have to create the=
> first database? I have no idea how to calculate the ideal number of thread=
> s that can
> be running together when the database is being created. If i put a single =
> is too much time, if i put several threads (5000 for example), probably the=
> firebird
> engine will not be able to process them because a lot of collisions will h=
> appen.
> =20
> ---
> Lu=EDs Carlos Dill Junges =A9
> =20
> =20
> =20=20=20=20=20
> =
> =20
>
>
> =20=20=20=20
>
> ---
> Lu=EDs Carlos Dill Junges =A9
>
> "A realidade de cada lugar e de cada =E9poca =E9 uma alucina=E7=E3o coletiv=
> a."
> =
> Bloom, Howard