Subject | Re: [Firebird-Architect] Re: Google-like scoring in databases |
---|---|
Author | Jim Starkey |
Post date | 2003-07-01T15:39:43Z |
Roman Rokytskyy wrote:
indexes the entire
web, and any popular search is going to produce a vast number of hits of
roughly equivalent
scores based only on content. The page ranking is a second score that
attempts to weight
the results with regard to what other sites thought useful. If the page
ranking were the
primarily score, the page with the most links would always win without
regard to context.
A database/site search is a slightly different case. The context, the
site, is obviously known,
and the user is generally looking for something that he or she has a
reasonably expectation
of finding. Massive number of duplicates is a difficult problem --
search refinement is almost
always the right solution. Searching the IBPhoenix for "blob" or
"blobs" is going get a large
number of duplicates hard to differentiate, the "blob seek" or "blob
subtypes" or "blob
filters" will probably do the trick. But even in the case of "blobs", a
scoring scheme
based on the references weighted by the inverse of the word number will
do a very good
just of find general discussion articles. In none of these cases would
a measure of
primary/foreign key relationships or external links be of any use. In
fact, within the
knowledge base, there aren't are relationships of the article table to
anything else, and
if there were, they would probably have nothing to do with the content
of the article.
I think there is a great deal to learn from Google and other search
tools with regard to
lexical transformations and content ranking, but I also think the
extrapolating external
link scoring to key relationships doesn't make sense. It may be useful
to use word
hits on searchable fields in records related by primary/foreign key to
general resultlist
elements that are joins, but frankly, after three years of looking, I
haven't found an
application where this would be useful.
Netfrastructure has a new product called NetfraSite designed as a
website-in-a-can
for multi-tier organizations. The primary organization is around a
hierarchy of groups
with site member belonging to zero or more groups. Content types
include articles,
scheduled events, procedures (like simple articles that don't expire),
and threaded
discussion. Following is a list of fields marked as searchable; the
rest of the schema
is pretty much intuitive. In most cases content items have foreign keys
to an author
and the group that owns the content. Please correct me if I missing
something, but
I don't see any particularly interesting ways to exploit key
relationships. (Please
excuse the HTML mail)
TABLENAME
</modules/home.nfs?a=Manchester&sql=select%20tablename%2Cfield%0Afrom%20system%2Efields%0Awhere%20schema%3D%27WEBBUILDER%27%0Aand%20%28flags%3D2%20or%20flags%3D3%20or%20flags%3D6%20or%20flags%3D7%29%0Aorder%20by%20tablename%2Cfield&order=TABLENAME>
FIELD
</modules/home.nfs?a=Manchester&sql=select%20tablename%2Cfield%0Afrom%20system%2Efields%0Awhere%20schema%3D%27WEBBUILDER%27%0Aand%20%28flags%3D2%20or%20flags%3D3%20or%20flags%3D6%20or%20flags%3D7%29%0Aorder%20by%20tablename%2Cfield&order=FIELD>
ARTICLES HEADER Update
ARTICLES TAGLINE Update
ARTICLES TEXT Update
DISCUSSIONS DESCRIPTION Update
EVENTS AGENDA Update
EVENTS KEYWORD Update
EVENTS MINUTES Update
EVENTS WHAT Update
GROUPS ABBREVIATION Update
GROUPS CHARTER Update
GROUPS DESCRIPTION Update
GROUPS GROUP Update
MESSAGES SUBJECT Update
MESSAGES TEXT Update
PAGES TEXT Update
PAGES TITLE Update
PEOPLE DESCRIPTION Update
PEOPLE FULL_NAME Update
PEOPLE SCREEN_NAME Update
PROCEDURES ABBREVIATION Update
PROCEDURES PROCEDURE Update
PROCEDURES TEXT Update
PUBLIC_EVENTS AGENDA Update
PUBLIC_EVENTS KEYWORD Update
PUBLIC_EVENTS MINUTES Update
PUBLIC_EVENTS WHAT Update
SECTIONS TEXT Update
SECTIONS TITLE Update
TEMP TEXT Update
TEMP TITLE Update
THREADS DESCRIPTION Update
THREADS NAME Update
[Non-text portions of this message have been removed]
>>The PageRank of a page A is given as follows:Google's page ranking strategy must be understood in context. Google
>>PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
>>
>>
>
>And that was exactly my question. In databases we do not have this
>information stored somewhere explicitly (except, probably, foreign key
>relationship).
>
>So I question, if PageRank is crucial for text search in relational
>databases. If it is, how are we going to calculate our RelationRank?
>
>If we say that PageRank is crucial and we do not find a way to model
>it with relations (our RelationRank), then the whole idea of having
>google-like search in Firebird is questionable: server will not
>"understand" content and I suspect that we will not be able to
>manipulate it neither from within stored procedures and UDFs nor from
>regular statements.
>
>
indexes the entire
web, and any popular search is going to produce a vast number of hits of
roughly equivalent
scores based only on content. The page ranking is a second score that
attempts to weight
the results with regard to what other sites thought useful. If the page
ranking were the
primarily score, the page with the most links would always win without
regard to context.
A database/site search is a slightly different case. The context, the
site, is obviously known,
and the user is generally looking for something that he or she has a
reasonably expectation
of finding. Massive number of duplicates is a difficult problem --
search refinement is almost
always the right solution. Searching the IBPhoenix for "blob" or
"blobs" is going get a large
number of duplicates hard to differentiate, the "blob seek" or "blob
subtypes" or "blob
filters" will probably do the trick. But even in the case of "blobs", a
scoring scheme
based on the references weighted by the inverse of the word number will
do a very good
just of find general discussion articles. In none of these cases would
a measure of
primary/foreign key relationships or external links be of any use. In
fact, within the
knowledge base, there aren't are relationships of the article table to
anything else, and
if there were, they would probably have nothing to do with the content
of the article.
I think there is a great deal to learn from Google and other search
tools with regard to
lexical transformations and content ranking, but I also think the
extrapolating external
link scoring to key relationships doesn't make sense. It may be useful
to use word
hits on searchable fields in records related by primary/foreign key to
general resultlist
elements that are joins, but frankly, after three years of looking, I
haven't found an
application where this would be useful.
Netfrastructure has a new product called NetfraSite designed as a
website-in-a-can
for multi-tier organizations. The primary organization is around a
hierarchy of groups
with site member belonging to zero or more groups. Content types
include articles,
scheduled events, procedures (like simple articles that don't expire),
and threaded
discussion. Following is a list of fields marked as searchable; the
rest of the schema
is pretty much intuitive. In most cases content items have foreign keys
to an author
and the group that owns the content. Please correct me if I missing
something, but
I don't see any particularly interesting ways to exploit key
relationships. (Please
excuse the HTML mail)
TABLENAME
</modules/home.nfs?a=Manchester&sql=select%20tablename%2Cfield%0Afrom%20system%2Efields%0Awhere%20schema%3D%27WEBBUILDER%27%0Aand%20%28flags%3D2%20or%20flags%3D3%20or%20flags%3D6%20or%20flags%3D7%29%0Aorder%20by%20tablename%2Cfield&order=TABLENAME>
FIELD
</modules/home.nfs?a=Manchester&sql=select%20tablename%2Cfield%0Afrom%20system%2Efields%0Awhere%20schema%3D%27WEBBUILDER%27%0Aand%20%28flags%3D2%20or%20flags%3D3%20or%20flags%3D6%20or%20flags%3D7%29%0Aorder%20by%20tablename%2Cfield&order=FIELD>
ARTICLES HEADER Update
ARTICLES TAGLINE Update
ARTICLES TEXT Update
DISCUSSIONS DESCRIPTION Update
EVENTS AGENDA Update
EVENTS KEYWORD Update
EVENTS MINUTES Update
EVENTS WHAT Update
GROUPS ABBREVIATION Update
GROUPS CHARTER Update
GROUPS DESCRIPTION Update
GROUPS GROUP Update
MESSAGES SUBJECT Update
MESSAGES TEXT Update
PAGES TEXT Update
PAGES TITLE Update
PEOPLE DESCRIPTION Update
PEOPLE FULL_NAME Update
PEOPLE SCREEN_NAME Update
PROCEDURES ABBREVIATION Update
PROCEDURES PROCEDURE Update
PROCEDURES TEXT Update
PUBLIC_EVENTS AGENDA Update
PUBLIC_EVENTS KEYWORD Update
PUBLIC_EVENTS MINUTES Update
PUBLIC_EVENTS WHAT Update
SECTIONS TEXT Update
SECTIONS TITLE Update
TEMP TEXT Update
TEMP TITLE Update
THREADS DESCRIPTION Update
THREADS NAME Update
[Non-text portions of this message have been removed]