Subject | Hypothetical near-match search |
---|---|
Author | Nigel Weeks |
Post date | 2005-04-27T00:59:32Z |
Please excuse my ramblings, this is just an idea, and I'm wondering if
people are doing something similar
I'd like an indexing system equipped with Soundex and Metaphone codes, for
fuzzy matching of words. I think I can do it.
Here's how:
/* A table containing every unique word in your database */
CREATE TABLE srch$word (
srch$id INT64 NOT NULL, /* A unique ID for each word */
srch$word VARCHAR(200) NOT NULL, /* The word, in uppercase */
srch$sdx VARCHAR(10), /* The Soundex Representation of the word */
srch$meta VARCHAR(50), /* The MetaPhone Representation of the word */
PRIMARY KEY(srch$id),
UNIQUE(srch$word)
);
CREATE GENERATOR gen_srch$word.srch$id;
/* A table referencing every table, field, and row, and the word contained
therein */
CREATE TABLE srch$locations (
rdb$relation_name CHAR(31) NOT NULL, /* Which table is it found in? */
rdb$field_name CHAR(31) NOT NULL, /* Which field is it found in? */
rdb$db_key CHAR(8), /* Which row is it found in? */
srch$id INT64, /* Link to the word */
PRIMARY KEY(rdb$relation_name,rdb$field_name,rdb$DB_KEY,srch$id)
/* A foreign key here might just slow things down. */
);
A small spider could then run over the records, populating the index with
rows.
A possible search query could then be:
SELECT rdb$relation_name, rdb$field_name, rdb_db_key FROM srch$locations LOC
JOIN srch$word WORD
ON LOC.srch$id = WORD.srch$id
WHERE WORD.srch$word = 'Smith' OR WORD.srch$sdx = 'S520'
'S530' is the soundex for the words: Saint, Sand, Sandy, Santee, Santi,
Schmid, Schmidt, Schmit, Schmitt, Shand, Shumate, Sinnott, Smith, Smithey,
Smoot, Smooty, Smyth, Smythe, etc, etc.
If I could then work out a clever join to the live tables to actually
retrieve the rows, I'd be cooking!
My question is, can anyone see any shortcomings in this approach?
Nige.
Nigel Weeks
Tech Support & Systems Developer
nweeks@...
www.examiner.com.au
71 - 75 Paterson Street
Launceston
Tasmania 7250
Australia
Phone: 03 6336 7234
[Non-text portions of this message have been removed]
people are doing something similar
I'd like an indexing system equipped with Soundex and Metaphone codes, for
fuzzy matching of words. I think I can do it.
Here's how:
/* A table containing every unique word in your database */
CREATE TABLE srch$word (
srch$id INT64 NOT NULL, /* A unique ID for each word */
srch$word VARCHAR(200) NOT NULL, /* The word, in uppercase */
srch$sdx VARCHAR(10), /* The Soundex Representation of the word */
srch$meta VARCHAR(50), /* The MetaPhone Representation of the word */
PRIMARY KEY(srch$id),
UNIQUE(srch$word)
);
CREATE GENERATOR gen_srch$word.srch$id;
/* A table referencing every table, field, and row, and the word contained
therein */
CREATE TABLE srch$locations (
rdb$relation_name CHAR(31) NOT NULL, /* Which table is it found in? */
rdb$field_name CHAR(31) NOT NULL, /* Which field is it found in? */
rdb$db_key CHAR(8), /* Which row is it found in? */
srch$id INT64, /* Link to the word */
PRIMARY KEY(rdb$relation_name,rdb$field_name,rdb$DB_KEY,srch$id)
/* A foreign key here might just slow things down. */
);
A small spider could then run over the records, populating the index with
rows.
A possible search query could then be:
SELECT rdb$relation_name, rdb$field_name, rdb_db_key FROM srch$locations LOC
JOIN srch$word WORD
ON LOC.srch$id = WORD.srch$id
WHERE WORD.srch$word = 'Smith' OR WORD.srch$sdx = 'S520'
'S530' is the soundex for the words: Saint, Sand, Sandy, Santee, Santi,
Schmid, Schmidt, Schmit, Schmitt, Shand, Shumate, Sinnott, Smith, Smithey,
Smoot, Smooty, Smyth, Smythe, etc, etc.
If I could then work out a clever join to the live tables to actually
retrieve the rows, I'd be cooking!
My question is, can anyone see any shortcomings in this approach?
Nige.
Nigel Weeks
Tech Support & Systems Developer
nweeks@...
www.examiner.com.au
71 - 75 Paterson Street
Launceston
Tasmania 7250
Australia
Phone: 03 6336 7234
[Non-text portions of this message have been removed]