Subject Re: [Firebird-Architect] Re: Full Text Search
Author Lester Caine
Jim Starkey wrote:

>>> 9. Search indexing should be html-aware
>>Not only HTML-aware, but also XML, RTF, MS Word, etc. But that is easy
>>to achieve. If that content is stored in the BLOB, we have already a
>>concept of BLOB filter. Just define a "searchable" BLOB type and
>>corresponding "HTML", "PDF", "RTF" BLOB types. The conversion between
>>that datatypes is done by filter.
> Oh, my head swims. Netfrastructure has filters for <*ml>, MSWord, and
> PDF. MSWord is a task on the scope of the Vulcan project. PDF has
> better documention but more intractable problems (hint: you have to
> emulate everything in a laser printer but the paper path and toner
> drum). Nobody's asked for RTF, thank god, but I have a converter on the
> shelf somewhere when they do.

I tend to convert the pdf's to provide a 'plain text' version, and I do
handle RTF in a similar way. Just need the right filter.

> Interesting that you missed the Open Office formats? Hey, Roman, get
> with the program. And what about WordPerfect? And the worst of them
> all, PowerPoint?

They are just filters - and I'll ask again - could they be handled as
BLOB filters or do we need something else. The first step is obviously
taking the data and identifying indexable 'character entities' - coming
up with at least an outline of where we are heading will mean I can
start switching the hard coded stuff in line with it ;)

Lester Caine
L.S.Caine Electronic Services