firebird-support - performance problem after "large" deletes (bug?)

Subject	performance problem after "large" deletes (bug?)
Author	Alexander Marx
Post date	2004-11-22T10:37:24Z

hi all,

we are here experiencing some serious performance issues with one of
our db applications ...

the problem initially showed up on a fb1.0 database running on linux 2.4;
however further tests showed that it is also reproducible on fb 1.5.1
(linux 2.6, windows xp), 1.5.2.4634 (linux 2.6, fbsd 5.3) ..

situation is as follows:

we have a rather small database with 119k pages (~440MB);
freshly restored from backup ..

SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF <-- just for my tests
Transaction - oldest = 113
Transaction - oldest active = 114
Transaction - oldest snapshot = 114
Transaction - Next = 117
Default Character set: NONE

doing a simple select count(*) statement usually take
some 10 seconds on this database (which is expected, since
it has to do a seq table scan)

SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

2308586

Current memory = 617584 (617568)
Delta memory = 196 (180)
Max memory = 727492 (727492)
Elapsed time= 7.48 sec (7.45 sec)
Cpu = 5.07 sec (5.23 sec)
Buffers = 75 (75)
Reads = 72792 (72792)
Writes = 0
Fetches = 4762679 (4762679)

* values in () from second run

however, if we delete a large amount (~2/3) of records from that
table and do a commit; then the next client that connects to the
database and runs the previous select count(*) statement gets
no results (even after several hours, where i ctrl-c'ed it);
and fb is constantly using 100% cpu during that time ...

tracing the process shows that it mostly does

read(0x4,0x82c7000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c5000,SEEK_SET) = 476860416 (0x1c6c5000)
read(0x4,0x82c6000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c6000,SEEK_SET) = 476864512 (0x1c6c6000)
read(0x4,0x82c5000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c7000,SEEK_SET) = 476868608 (0x1c6c7000)
read(0x4,0x82c4000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c8000,SEEK_SET) = 476872704 (0x1c6c8000)
read(0x4,0x82c3000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c9000,SEEK_SET) = 476876800 (0x1c6c9000)
read(0x4,0x82c2000,0x1000)

but physical disk i/o is surely not the limit since vmstat output
shows more or less zero physical disk-activity ..

procs memory page disk faults Cpu
r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id
3 1 0 308564 42904 24 0 0 0 24 0 0 458 76309 1228 42 58 0
6 1 0 308564 42904 1 0 0 0 0 0 13 426 77517 1032 35 65 0
4 1 0 308564 42904 24 0 0 0 24 0 0 409 77459 942 32 68 0
5 1 0 308564 42904 0 0 0 0 0 0 0 405 77817 919 33 67 0
3 1 0 308564 42904 24 0 0 0 24 0 0 392 77466 891 26 74 0
7 1 0 308564 42904 0 0 0 0 0 0 0 448 77276 1096 41 59 0
4 1 0 308564 42888 24 0 0 0 24 0 0 451 77003 1160 36 64 0
5 1 0 308564 42888 0 0 0 0 0 0 3 409 77546 933 38 62 0
3 1 0 308564 42888 24 0 0 0 24 0 0 404 77355 938 38 6

its completely cpu-bound (no wonder 77000 syscalls/s surely has its price)

for the sake of it, i then tried (on a freshly restored backup) to delete
just some 20000 records from that table, and suddenly i get a result ..
but it takes a whopping 880 seconds ... see:

SQL> delete from accountrecord where accr_id >= 580000 and accr_id <= 600000;

PLAN (ACCOUNTRECORD NATURAL)
Current memory = 688420
Delta memory = 68940
Max memory = 727492
Elapsed time= 8.59 sec
Cpu = 6.00 sec
Buffers = 75
Reads = 72822
Writes = 316
Fetches = 4792780

SQL> commit;
Current memory = 649112
Delta memory = -39308
Max memory = 727492
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 2
Fetches = 2

new client connects .... and runs the select count(*)

bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

2278585

Current memory = 3984248
Delta memory = 8824
Max memory = 4038016
Elapsed time= 880.21 sec
Cpu = 260.32 sec
Buffers = 823
Reads = 32335520
Writes = 122096
Fetches = 37167441

woa?! look at the stats! .. why is it doing 32 mio
reads on a database that has just a bit over 100000 pages??

for completeness then i tried the original large delete
over the weekend ... and the results are .. well ..
lets just say "they are not impressive" ;-)

SQL> delete from accountrecord where accr_val_year >= 2003;

PLAN (ACCOUNTRECORD NATURAL)
Current memory = 2743632
Delta memory = 2133044
Max memory = 2743632
Elapsed time= 16.47 sec
Cpu = 12.82 sec
Buffers = 75
Reads = 72822
Writes = 54811
Fetches = 9932188

SQL> commit;
Current memory = 649112
Delta memory = -2094520
Max memory = 2749516
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 74
Fetches = 2

SQL> quit;
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF
Transaction - oldest = 113
Transaction - oldest active = 114

bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> set stats;
SQL> set plan;
SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

585449

Current memory = 3916100
Delta memory = 3447500
Max memory = 3917268
Elapsed time= 51155.54 sec
Cpu = 16832.51 sec
Buffers = 819
Reads = 2139707744
Writes = 5929156
Fetches = -2128407134 <<<---- wtf?

imho .. something is pretty wrong here; i can understand that the
database does some sort of garbage collection after large deletes
but 50000 seconds (with 100% cpu all the time)?!?!

i mean ... within that time frame i can print that database
to paper and do the select() by hand ... :-)

Any ideas/suggestions greatly appreciated ...

if you need more info pls tell me ... i'll try to provide whatever is
necessary .. we need to get this fixed, otherwise i fear we have to jump
ship, since waiting 14 hours for a simple select every once in a while is
clearly unacceptable ...

regards,
alex.