Subject performance problem after "large" deletes (bug?)
Author Alexander Marx
hi all,

we are here experiencing some serious performance issues with one of
our db applications ...

the problem initially showed up on a fb1.0 database running on linux 2.4;
however further tests showed that it is also reproducible on fb 1.5.1
(linux 2.6, windows xp), 1.5.2.4634 (linux 2.6, fbsd 5.3) ..

situation is as follows:

we have a rather small database with 119k pages (~440MB);
freshly restored from backup ..


SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF <-- just for my tests
Transaction - oldest = 113
Transaction - oldest active = 114
Transaction - oldest snapshot = 114
Transaction - Next = 117
Default Character set: NONE


doing a simple select count(*) statement usually take
some 10 seconds on this database (which is expected, since
it has to do a seq table scan)


SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

2308586

Current memory = 617584 (617568)
Delta memory = 196 (180)
Max memory = 727492 (727492)
Elapsed time= 7.48 sec (7.45 sec)
Cpu = 5.07 sec (5.23 sec)
Buffers = 75 (75)
Reads = 72792 (72792)
Writes = 0
Fetches = 4762679 (4762679)

* values in () from second run


however, if we delete a large amount (~2/3) of records from that
table and do a commit; then the next client that connects to the
database and runs the previous select count(*) statement gets
no results (even after several hours, where i ctrl-c'ed it);
and fb is constantly using 100% cpu during that time ...

tracing the process shows that it mostly does

read(0x4,0x82c7000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c5000,SEEK_SET) = 476860416 (0x1c6c5000)
read(0x4,0x82c6000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c6000,SEEK_SET) = 476864512 (0x1c6c6000)
read(0x4,0x82c5000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c7000,SEEK_SET) = 476868608 (0x1c6c7000)
read(0x4,0x82c4000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c8000,SEEK_SET) = 476872704 (0x1c6c8000)
read(0x4,0x82c3000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c9000,SEEK_SET) = 476876800 (0x1c6c9000)
read(0x4,0x82c2000,0x1000)

but physical disk i/o is surely not the limit since vmstat output
shows more or less zero physical disk-activity ..


procs memory page disk faults Cpu
r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id
3 1 0 308564 42904 24 0 0 0 24 0 0 458 76309 1228 42 58 0
6 1 0 308564 42904 1 0 0 0 0 0 13 426 77517 1032 35 65 0
4 1 0 308564 42904 24 0 0 0 24 0 0 409 77459 942 32 68 0
5 1 0 308564 42904 0 0 0 0 0 0 0 405 77817 919 33 67 0
3 1 0 308564 42904 24 0 0 0 24 0 0 392 77466 891 26 74 0
7 1 0 308564 42904 0 0 0 0 0 0 0 448 77276 1096 41 59 0
4 1 0 308564 42888 24 0 0 0 24 0 0 451 77003 1160 36 64 0
5 1 0 308564 42888 0 0 0 0 0 0 3 409 77546 933 38 62 0
3 1 0 308564 42888 24 0 0 0 24 0 0 404 77355 938 38 6


its completely cpu-bound (no wonder 77000 syscalls/s surely has its price)


for the sake of it, i then tried (on a freshly restored backup) to delete
just some 20000 records from that table, and suddenly i get a result ..
but it takes a whopping 880 seconds ... see:

SQL> delete from accountrecord where accr_id >= 580000 and accr_id <= 600000;

PLAN (ACCOUNTRECORD NATURAL)
Current memory = 688420
Delta memory = 68940
Max memory = 727492
Elapsed time= 8.59 sec
Cpu = 6.00 sec
Buffers = 75
Reads = 72822
Writes = 316
Fetches = 4792780


SQL> commit;
Current memory = 649112
Delta memory = -39308
Max memory = 727492
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 2
Fetches = 2


new client connects .... and runs the select count(*)


bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

2278585

Current memory = 3984248
Delta memory = 8824
Max memory = 4038016
Elapsed time= 880.21 sec
Cpu = 260.32 sec
Buffers = 823
Reads = 32335520
Writes = 122096
Fetches = 37167441


woa?! look at the stats! .. why is it doing 32 mio
reads on a database that has just a bit over 100000 pages??


for completeness then i tried the original large delete
over the weekend ... and the results are .. well ..
lets just say "they are not impressive" ;-)


SQL> delete from accountrecord where accr_val_year >= 2003;

PLAN (ACCOUNTRECORD NATURAL)
Current memory = 2743632
Delta memory = 2133044
Max memory = 2743632
Elapsed time= 16.47 sec
Cpu = 12.82 sec
Buffers = 75
Reads = 72822
Writes = 54811
Fetches = 9932188

SQL> commit;
Current memory = 649112
Delta memory = -2094520
Max memory = 2749516
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 74
Fetches = 2

SQL> quit;
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF
Transaction - oldest = 113
Transaction - oldest active = 114

bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> set stats;
SQL> set plan;
SQL> select count(*) from accountrecord;

PLAN (ACCOUNTRECORD NATURAL)

COUNT
============

585449

Current memory = 3916100
Delta memory = 3447500
Max memory = 3917268
Elapsed time= 51155.54 sec
Cpu = 16832.51 sec
Buffers = 819
Reads = 2139707744
Writes = 5929156
Fetches = -2128407134 <<<---- wtf?


imho .. something is pretty wrong here; i can understand that the
database does some sort of garbage collection after large deletes
but 50000 seconds (with 100% cpu all the time)?!?!

i mean ... within that time frame i can print that database
to paper and do the select() by hand ... :-)

Any ideas/suggestions greatly appreciated ...

if you need more info pls tell me ... i'll try to provide whatever is
necessary .. we need to get this fixed, otherwise i fear we have to jump
ship, since waiting 14 hours for a simple select every once in a while is
clearly unacceptable ...


regards,
alex.