firebird-support - Re: Blob vs Varchar

Subject	Re: Blob vs Varchar - Performance Benchmarks
Author	Adam
Post date	2006-10-29T00:48:21Z

--- In firebird-support@yahoogroups.com, Rajesh Punjabi <rajesh@...>
wrote:

>
>
> Hi All,
>
> I have a roughly 3.5 GB FB 1.5 production database running on a

Linux

> FC4 server with an AMD 1.8GHz processor and 3GB RAM. This database

> used to store resumes as we are a recruitment company.
>
> Over the last 1 year the database is slowing down considerably.

Users

> have increased from 4-5 to about 20 concurrent users in this

period. The

> primary slowdown occurs when users do a brute force search for

keywords

> in the body of the resume (stored in the blob).

From the start, you can see such a design has no scalability beyond
throwing hardware at it. You need to think to yourself, 'if I had to
do this manually, what would I do?'. Would you when looking for some
resumes take the first resume, scan through it for your keyword, then
take the second resume, scan through it, etc? Of course not, it is
just not practical. The more people you have scanning through the
resumes, the more work needs to be done. The more resumes you have,
the more work needs to be done.

A better approach is to attach keywords to each resume so that you do
not need to consider the ones that have absolutely no relevance to
your keywords.

Once you have keywords, the next problem is ranking, deciding which
of the resumes that contain your queries are the most significant.
Then you need to consider exclusion lists, most likely all resumes
will contain the word 'the' or 'and', so if someone uses that in a
search keyword, you may want to exclude considering it. If your
resume contains many of the search terms you may want to increase its
weighting. You may want to consider different spellings of words, or
even use some sort of thesaurus logic (resume or CV)

> All other queries with
> indexes run reasonably fast. (sub 10 secs).
>
> Is there any way to directly optimize blob search that anyone here

can

> point me to ?
>
> I have looked up varchar fields and if I shift the data from blobs

> varchar will performance improve ? Are there any benchmarks for

this ?

Again, this is the wrong approach. It is impossible to index a BLOB,
and containing searches on a varchar field can not be indexed
either. 'Starting with' searches can use an index, but that is
unlikely to be very useful in your case.

Ann has already linked to a couple of FTS implementations that you
should look at. I have attached some proof of concept code I did to
implement an indexed search of a list of titles.

As each title is added, I tokenise it into words. I establish a
relationship between the word and the title and calculate the soundex
of the word and the number of titles each word occurs in. I consider
that the fewer titles a particular word occurs in, the more useful
that word is in identifying potential titles.

This allows me to (in a very simple sense) create a search that
quickly identifies a subset of all titles which match the search
criteria with relatively little expense. I imagine a very similar
concept could be applied to your logic. The tokenisation is probably
more complex so may have to be client driven, but such an approach is
worth investigating.

--- FTS

-- CREDITS:
-- SOUNDEX stored procedure was provided by Si Carter to the
www.fbtalk.net website (http://www.fbtalk.net/viewtopic.php?id=182)

-- SPLITTEXT stored procedure was provided by Si Carter to the
www.fbtalk.net website (http://www.fbtalk.net/viewtopic.php?id=238)

DECLARE EXTERNAL FUNCTION substr
CSTRING(255), SMALLINT, SMALLINT
RETURNS CSTRING(255) FREE_IT
ENTRY_POINT 'IB_UDF_substr' MODULE_NAME 'ib_udf';

DECLARE EXTERNAL FUNCTION strlen
CSTRING(32767)
RETURNS INTEGER BY VALUE
ENTRY_POINT 'IB_UDF_strlen' MODULE_NAME 'ib_udf';

DECLARE EXTERNAL FUNCTION ascii_char
INTEGER
RETURNS CSTRING(1) FREE_IT
ENTRY_POINT 'IB_UDF_ascii_char' MODULE_NAME 'ib_udf';

DECLARE EXTERNAL FUNCTION rtrim
CSTRING(255)
RETURNS CSTRING(255) FREE_IT
ENTRY_POINT 'IB_UDF_rtrim' MODULE_NAME 'ib_udf';

DECLARE EXTERNAL FUNCTION ltrim
CSTRING(255)
RETURNS CSTRING(255) FREE_IT
ENTRY_POINT 'IB_UDF_ltrim' MODULE_NAME 'ib_udf';

SET TERM ^ ;

CREATE OR ALTER PROCEDURE SOUNDEX(ipWORD VARCHAR(50))
RETURNS (opRESULT VARCHAR(5))
AS
DECLARE VARIABLE vCurrChar CHAR(1);
DECLARE VARIABLE vIDX INTEGER;
DECLARE VARIABLE vLen INTEGER;
BEGIN
-- if its a null/blank word then exit returning nothing
IF (ipWORD IS NULL OR ipWORD = '') THEN
EXIT;

-- were starting with the second character
vIDX = 2;

-- make the word uppercase
ipWORD = UPPER(ipWORD);

/* grab the first letter from the word
and set the initial length to 1 */
opRESULT = SUBSTR(ipWORD, 1, 1);
vLen = 1;

WHILE (vIDX < 25) DO
BEGIN
-- get the Nth char from the string
vCurrChar = CAST(SUBSTR(ipWORD, :vIDX, :vIDX) AS CHAR(1));

-- calculate its numeric value
IF (vCurrChar = 'R') THEN
vCurrChar = '6';
ELSE IF (vCurrChar = 'M' OR vCurrChar = 'N') THEN
vCurrChar = '5';
ELSE IF (vCurrChar = 'L') THEN
vCurrChar = '4';
ELSE IF (vCurrChar = 'D' OR vCurrChar = 'T') THEN
vCurrChar = '3';
ELSE IF (vCurrChar = 'B' OR vCurrChar = 'F' OR
vCurrChar = 'P' OR vCurrChar = 'V') THEN
vCurrChar = '1';
ELSE IF (vCurrChar = 'C' OR vCurrChar = 'G' OR
vCurrChar = 'J' OR vCurrChar = 'K' OR vCurrChar = 'Q' OR
vCurrChar = 'S' OR vCurrChar = 'X' OR vCurrChar = 'Z') THEN
vCurrChar = '2';
ELSE vCurrChar = '';

-- if its not a blank string then...
IF (vCurrChar <> '') THEN
BEGIN
-- add it to the current result and increment the length
opRESULT = opRESULT || vCurrChar;
vLen = vLen + 1;

/* the maximum length to return is 5, if
we have currently reached it then return
the result and exit the proc */
IF (vLen = 5) THEN
BEGIN
SUSPEND;
EXIT;
END
END

-- increment the char index
vIDX = vIDX + 1;
END

/* soundex should return 5 chars, if its not that
long already then pad it with zero's */
WHILE (vLen < 5) DO
BEGIN
opRESULT = opRESULT || '0';
vLen = vLen + 1;
END

-- finally return the result
SUSPEND;
END
^

CREATE OR ALTER PROCEDURE SPLITTEXT
(
ipTEXT VARCHAR(32000),
ipSPLITCHARS VARCHAR(20),
ipIGNORECHARS VARCHAR(20)
)
RETURNS
(
opTEXT VARCHAR(500))
AS
DECLARE VARIABLE vCurrChar VARCHAR(1);
DECLARE VARIABLE vIDX INTEGER;
DECLARE VARIABLE vLen INTEGER;
BEGIN
-- if the text is null/blank then exit
vLen = STRLEN(ipTEXT);

IF (ipTEXT IS NULL OR ipTEXT = '') THEN
BEGIN
SUSPEND;
EXIT;
END

-- default to spaces and line sep if no split char defined
IF ((ipSPLITCHARS IS NULL) OR (ipSPLITCHARS = '')) THEN
ipSPLITCHARS = ' ' || ASCII_CHAR(13) || ASCII_CHAR(10);

IF (ipIGNORECHARS IS NULL) THEN
ipIGNORECHARS = '';

/* were starting with the first character and
on the first octet */
vIDX = 1;
opTEXT = '';

-- get the 1st char from the string
vCurrChar = CAST(SUBSTR(ipTEXT, :vIDX, :vIDX) AS VARCHAR(1));

WHILE (vIDX <= vLen) DO
BEGIN
-- is it a seperator char?
IF (ipSPLITCHARS CONTAINING vCURRCHAR) THEN
BEGIN
opTEXT = RTRIM(LTRIM(opTEXT));

-- return the current text, if there is any
IF (opTEXT <> '') THEN
SUSPEND;

-- reset the output buffer
opTEXT = '';
END ELSE
BEGIN
IF ((vCurrChar <> '') AND (ipIGNORECHARS NOT CONTAINING
vCurrChar)) THEN
opTEXT = opTEXT || vCurrChar;
END
-- increment the char index
vIDX = vIDX + 1;

-- get the Nth char from the string
vCurrChar = CAST(SUBSTR(ipTEXT, :vIDX, :vIDX) AS VARCHAR(1));
END

-- if there is any text left then return it
IF (opTEXT <> '') THEN
SUSPEND;

END
^

SET TERM ; ^

CREATE GENERATOR GEN_WORDSID;
CREATE GENERATOR GEN_TITLEID;
CREATE GENERATOR GEN_TITLESWORDSID;

CREATE DOMAIN PK BIGINT NOT NULL;

CREATE TABLE TITLES (
ID PK,
NAME VARCHAR(200),
CONSTRAINT PK_TITLES PRIMARY KEY (ID)
);

CREATE TABLE WORDS
(
ID PK,
NAME VARCHAR(200),
SOUNDEX VARCHAR(10),
TITLECOUNT BIGINT,
CONSTRAINT PK_WORDS PRIMARY KEY (ID),
CONSTRAINT UQ_WORDSNAME UNIQUE (NAME)
);

CREATE INDEX IX_WORDSSOUNDEX ON WORDS(SOUNDEX);

CREATE TABLE TITLESWORDS
(
ID PK,
TITLESID BIGINT,
WORDSID BIGINT,
CONSTRAINT PK_TITLESWORDS PRIMARY KEY (ID),
CONSTRAINT FK_TITLESWORDSTITLESID FOREIGN KEY (TITLESID) REFERENCES
TITLES (ID) ON UPDATE CASCADE ON DELETE CASCADE,
CONSTRAINT FK_TITLESWORDSWORDSID FOREIGN KEY (WORDSID) REFERENCES
WORDS (ID) ON UPDATE CASCADE ON DELETE CASCADE
);

COMMIT;

SET TERM ^ ;

CREATE OR ALTER TRIGGER TITLES_AI FOR TITLES
ACTIVE AFTER INSERT POSITION 0
AS
DECLARE VARIABLE AWORD VARCHAR(100);
DECLARE VARIABLE WORDSID BIGINT;
BEGIN
FOR
SELECT DISTINCT OPTEXT
FROM SPLITTEXT(NEW.NAME, ' ', '-, (, ), ''')
INTO :AWORD
DO
BEGIN

WORDSID = NULL;

SELECT ID
FROM WORDS
WHERE NAME = :AWORD
INTO :WORDSID;

IF (WORDSID IS NULL) THEN
BEGIN
WORDSID = GEN_ID(GEN_WORDSID, 1);
INSERT INTO WORDS(ID, NAME, TITLECOUNT) VALUES
(:WORDSID, :AWORD, 1);
END
ELSE
BEGIN
UPDATE WORDS SET
TITLECOUNT = TITLECOUNT + 1
WHERE ID = :WORDSID;
END

INSERT INTO TITLESWORDS (ID, TITLESID, WORDSID) VALUES (
GEN_ID(GEN_TITLESWORDSID, 1),
NEW.ID,
:WORDSID
);
END
END
^

CREATE OR ALTER TRIGGER WORDS_BI FOR WORDS
ACTIVE BEFORE INSERT POSITION 0
AS
BEGIN
IF (NEW.SOUNDEX IS NULL) THEN
BEGIN
SELECT OPRESULT
FROM SOUNDEX(NEW.NAME)
INTO NEW.SOUNDEX;
END
END
^

CREATE OR ALTER PROCEDURE INTERNALSEARCH
(
SEARCHSTRING VARCHAR(200)
)
RETURNS
(
SOUNDEX VARCHAR(5),
TITLECOUNT BIGINT
)
AS
DECLARE VARIABLE SEARCHWORD VARCHAR(50);
BEGIN
FOR SELECT OPTEXT
FROM SPLITTEXT(:SEARCHSTRING, ' ', '-,
(, ), ''')
INTO :SEARCHWORD
DO
BEGIN
SELECT OPRESULT
FROM SOUNDEX(:SEARCHWORD)
INTO :SOUNDEX;

SELECT SUM(TITLECOUNT)
FROM WORDS
WHERE SOUNDEX = :SOUNDEX
INTO :TITLECOUNT;

SUSPEND;
END
END
^

CREATE OR ALTER PROCEDURE SEARCHFOR
(
SEARCHSTRING VARCHAR(200),
MAXRESULTS BIGINT
)
RETURNS
(
TITLESID BIGINT,
TITLESWORDS VARCHAR(200)
)
AS
DECLARE VARIABLE SEARCHWORD VARCHAR(50);
DECLARE VARIABLE SOUNDEX VARCHAR(5);
DECLARE VARIABLE I BIGINT;
BEGIN
I = 1;
FOR
SELECT SOUNDEX
FROM INTERNALSEARCH(:SEARCHSTRING)
ORDER BY TITLECOUNT
INTO :SOUNDEX
DO
BEGIN
FOR
SELECT T.ID, T.NAME
FROM WORDS W
JOIN TITLESWORDS TW ON (W.ID = TW.WORDSID)
JOIN TITLES T ON (TW.TITLESID = T.ID)
WHERE SOUNDEX = :SOUNDEX
INTO :TITLESID, :TITLESWORDS
DO
BEGIN
SUSPEND;
I = :I + 1;
IF ((:MAXRESULTS > 0) AND
(:I = :MAXRESULTS)) THEN EXIT;
END
END
END
^

SET TERM ; ^

COMMIT;

--- sample data

INSERT INTO TITLES (ID, NAME) VALUES(GEN_ID(GEN_TITLEID,1), 'Hello
Foo');
INSERT INTO TITLES (ID, NAME) VALUES(GEN_ID(GEN_TITLEID,1), 'World
Bar');
INSERT INTO TITLES (ID, NAME) VALUES(GEN_ID(GEN_TITLEID,1), 'Hello
World');
INSERT INTO TITLES (ID, NAME) VALUES(GEN_ID(GEN_TITLEID,1), 'Foo
Bar');

COMMIT

--- Look for the text

SELECT *
FROM SEARCHFOR('Hello World', 10);

COMMIT;

---

Adam