Subject Re: [firebird-support] Design Question: Storing and using generic text "tags"
Author Daniel Albuschat
2009/6/26 Helen Borrie <helebor@...>:
>
> At 02:47 PM 26/06/2009, you wrote:
>>I'll be adding a tags system to an existing application so that users can
>>enter one or more tags to help search and categorize various items, similar
>>to what photo management software or blogs do.

[big snip]

>>3) Have a separate TAG table with a many-to-many mapping table to reference
>>ITEMs.
>> Pull apart tags on entry and store new tags in separate TAG rows, create
>>mapping row for each tag.
>> Use exact matching on TAG rows for searching, join to ITEM table.
>> For the report simply join ITEM to TAG through mapping, order by TAG.
>
> What's complex about it? For keyword searching I use two tables: one to hold
> the keywords (which I pump in, all upper case, from a linked list in the
> application, composed from user selections, user input or some routine that
> dumbly goes through and picks out all words except those in an exception
> list. The other table holds an intersection for keywords and documents, with
> FKs to the two tables and a unique constraint on the keyword and the
> document ID. (Actually, I'm working on just such a project right now.)
>
> I have an executable SP to which I send a keyword and a document id as
> arguments. It simply inserts a record in the intersection table. Done. It
> doesn't have to be a SP though: it's a drop-dead-simple parameterised DSQL
> statement.

I absolutely second that. I'd imagine a design like this:

<sql>
/* Whatever data you want to hold goes here */
create table data
(
id integer not null,
somedata blob,
/* here goes your data and additional fields you may need */
constraint pk_data primary key(id)
)^
create generator data^

/* tags are stored in a separte table, while the name is unique and
always stored in uppercase */
create table tags
(
id integer not null,
name varchar(127),
/* the size of name is actually not important, because the tags
table will not contain much data.
but because it's indexed in a unique index, the field's size may
not exceed the index-size-limit
(which is proportional to the page size of this specific database */
constraint pk_tags primary key(id),
constraint un_tags_name unique(name)
)^
create generator tags^

/* this referential table creates the links between a data row and
an arbitrary number of tags */
create table data_tags
(
id integer not null,
id_data integer not null,
id_tag integer not null,
constraint pk_data_tags primary key(id),
constraint fk_data_tags_data foreign key(id_data) references
data(id) on update cascade on delete cascade,
constraint fk_data_tags_tag foreign key(id_tag) references tags(id)
on update cascade on delete cascade
)^
create generator data_tags^

/* use this stored procedure to find a tag by name, is automatically
case-insensitive */
create procedure find_tag(name varchar(255)) returns (id_tag integer) as
begin
select id from tags where name=upper(:name) into :id_tag;
if (:id_tag is not null) then
begin
id_tag = gen_id(tags, 1);
insert into tags(id, name) values (:id_tag, upper(:name));
end
end^

/* searches for one specific tag */
create procedure find_data_by_tag(tag varchar(255)) returns (id_data integer) as
declare variable id_tag integer;
begin
select id from tags where name=upper(:tag) into :id_tag;
if (:id_tag is not null) then
for select id_data from data_tags where id_tag=:id_tag into :id_data do
begin
suspend;
end
end^
</sql>

Parsing a comma-separated list in SQL is kinda tricky, so I would do
this in the program. Once
you have a list of tags, do this (pseudo-code, psql-like):

<code>
declare tags as list; // already filled with tags from the comma separated list
declare ids as list; // will be filled with the tag's ids

for tag in tags do
begin
id = execute procedure find_tag(tag)
add id to ids
end

execute statement
"select distinct(d.*) from data_tags r right join data d on
d.id=r.id_data where r.id_tag in (" + make_separated_list(ids) + ")";
</code>

(the "in"-syntax is limited to 1500 entries, so a maximum of 1500 tags
per select is possible)

I've used distinct because the select statement will return the same
data row multiple times when more than one tag from the list is
assigned to that data.
select d.* from data d where exists(select * from data_tags r where
r.id_data=d.id and r.id_tag in (foo))
will be significantly slower.

Because I'm doing something a bit similar right now, I used this post
to have my design peer-revied at the same time. ;-)

Regards,

Daniel Albuschat
--
eat(this); // delicious suicide