Subject | performance problem after "large" deletes (bug?) |
---|---|
Author | Alexander Marx |
Post date | 2004-11-22T10:37:24Z |
hi all,
we are here experiencing some serious performance issues with one of
our db applications ...
the problem initially showed up on a fb1.0 database running on linux 2.4;
however further tests showed that it is also reproducible on fb 1.5.1
(linux 2.6, windows xp), 1.5.2.4634 (linux 2.6, fbsd 5.3) ..
situation is as follows:
we have a rather small database with 119k pages (~440MB);
freshly restored from backup ..
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF <-- just for my tests
Transaction - oldest = 113
Transaction - oldest active = 114
Transaction - oldest snapshot = 114
Transaction - Next = 117
Default Character set: NONE
doing a simple select count(*) statement usually take
some 10 seconds on this database (which is expected, since
it has to do a seq table scan)
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
2308586
Current memory = 617584 (617568)
Delta memory = 196 (180)
Max memory = 727492 (727492)
Elapsed time= 7.48 sec (7.45 sec)
Cpu = 5.07 sec (5.23 sec)
Buffers = 75 (75)
Reads = 72792 (72792)
Writes = 0
Fetches = 4762679 (4762679)
* values in () from second run
however, if we delete a large amount (~2/3) of records from that
table and do a commit; then the next client that connects to the
database and runs the previous select count(*) statement gets
no results (even after several hours, where i ctrl-c'ed it);
and fb is constantly using 100% cpu during that time ...
tracing the process shows that it mostly does
read(0x4,0x82c7000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c5000,SEEK_SET) = 476860416 (0x1c6c5000)
read(0x4,0x82c6000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c6000,SEEK_SET) = 476864512 (0x1c6c6000)
read(0x4,0x82c5000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c7000,SEEK_SET) = 476868608 (0x1c6c7000)
read(0x4,0x82c4000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c8000,SEEK_SET) = 476872704 (0x1c6c8000)
read(0x4,0x82c3000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c9000,SEEK_SET) = 476876800 (0x1c6c9000)
read(0x4,0x82c2000,0x1000)
but physical disk i/o is surely not the limit since vmstat output
shows more or less zero physical disk-activity ..
procs memory page disk faults Cpu
r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id
3 1 0 308564 42904 24 0 0 0 24 0 0 458 76309 1228 42 58 0
6 1 0 308564 42904 1 0 0 0 0 0 13 426 77517 1032 35 65 0
4 1 0 308564 42904 24 0 0 0 24 0 0 409 77459 942 32 68 0
5 1 0 308564 42904 0 0 0 0 0 0 0 405 77817 919 33 67 0
3 1 0 308564 42904 24 0 0 0 24 0 0 392 77466 891 26 74 0
7 1 0 308564 42904 0 0 0 0 0 0 0 448 77276 1096 41 59 0
4 1 0 308564 42888 24 0 0 0 24 0 0 451 77003 1160 36 64 0
5 1 0 308564 42888 0 0 0 0 0 0 3 409 77546 933 38 62 0
3 1 0 308564 42888 24 0 0 0 24 0 0 404 77355 938 38 6
its completely cpu-bound (no wonder 77000 syscalls/s surely has its price)
for the sake of it, i then tried (on a freshly restored backup) to delete
just some 20000 records from that table, and suddenly i get a result ..
but it takes a whopping 880 seconds ... see:
SQL> delete from accountrecord where accr_id >= 580000 and accr_id <= 600000;
PLAN (ACCOUNTRECORD NATURAL)
Current memory = 688420
Delta memory = 68940
Max memory = 727492
Elapsed time= 8.59 sec
Cpu = 6.00 sec
Buffers = 75
Reads = 72822
Writes = 316
Fetches = 4792780
SQL> commit;
Current memory = 649112
Delta memory = -39308
Max memory = 727492
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 2
Fetches = 2
new client connects .... and runs the select count(*)
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
2278585
Current memory = 3984248
Delta memory = 8824
Max memory = 4038016
Elapsed time= 880.21 sec
Cpu = 260.32 sec
Buffers = 823
Reads = 32335520
Writes = 122096
Fetches = 37167441
woa?! look at the stats! .. why is it doing 32 mio
reads on a database that has just a bit over 100000 pages??
for completeness then i tried the original large delete
over the weekend ... and the results are .. well ..
lets just say "they are not impressive" ;-)
SQL> delete from accountrecord where accr_val_year >= 2003;
PLAN (ACCOUNTRECORD NATURAL)
Current memory = 2743632
Delta memory = 2133044
Max memory = 2743632
Elapsed time= 16.47 sec
Cpu = 12.82 sec
Buffers = 75
Reads = 72822
Writes = 54811
Fetches = 9932188
SQL> commit;
Current memory = 649112
Delta memory = -2094520
Max memory = 2749516
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 74
Fetches = 2
SQL> quit;
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF
Transaction - oldest = 113
Transaction - oldest active = 114
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> set stats;
SQL> set plan;
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
585449
Current memory = 3916100
Delta memory = 3447500
Max memory = 3917268
Elapsed time= 51155.54 sec
Cpu = 16832.51 sec
Buffers = 819
Reads = 2139707744
Writes = 5929156
Fetches = -2128407134 <<<---- wtf?
imho .. something is pretty wrong here; i can understand that the
database does some sort of garbage collection after large deletes
but 50000 seconds (with 100% cpu all the time)?!?!
i mean ... within that time frame i can print that database
to paper and do the select() by hand ... :-)
Any ideas/suggestions greatly appreciated ...
if you need more info pls tell me ... i'll try to provide whatever is
necessary .. we need to get this fixed, otherwise i fear we have to jump
ship, since waiting 14 hours for a simple select every once in a while is
clearly unacceptable ...
regards,
alex.
we are here experiencing some serious performance issues with one of
our db applications ...
the problem initially showed up on a fb1.0 database running on linux 2.4;
however further tests showed that it is also reproducible on fb 1.5.1
(linux 2.6, windows xp), 1.5.2.4634 (linux 2.6, fbsd 5.3) ..
situation is as follows:
we have a rather small database with 119k pages (~440MB);
freshly restored from backup ..
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF <-- just for my tests
Transaction - oldest = 113
Transaction - oldest active = 114
Transaction - oldest snapshot = 114
Transaction - Next = 117
Default Character set: NONE
doing a simple select count(*) statement usually take
some 10 seconds on this database (which is expected, since
it has to do a seq table scan)
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
2308586
Current memory = 617584 (617568)
Delta memory = 196 (180)
Max memory = 727492 (727492)
Elapsed time= 7.48 sec (7.45 sec)
Cpu = 5.07 sec (5.23 sec)
Buffers = 75 (75)
Reads = 72792 (72792)
Writes = 0
Fetches = 4762679 (4762679)
* values in () from second run
however, if we delete a large amount (~2/3) of records from that
table and do a commit; then the next client that connects to the
database and runs the previous select count(*) statement gets
no results (even after several hours, where i ctrl-c'ed it);
and fb is constantly using 100% cpu during that time ...
tracing the process shows that it mostly does
read(0x4,0x82c7000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c5000,SEEK_SET) = 476860416 (0x1c6c5000)
read(0x4,0x82c6000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c6000,SEEK_SET) = 476864512 (0x1c6c6000)
read(0x4,0x82c5000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c7000,SEEK_SET) = 476868608 (0x1c6c7000)
read(0x4,0x82c4000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c8000,SEEK_SET) = 476872704 (0x1c6c8000)
read(0x4,0x82c3000,0x1000) = 4096 (0x1000)
lseek(4,0x1c6c9000,SEEK_SET) = 476876800 (0x1c6c9000)
read(0x4,0x82c2000,0x1000)
but physical disk i/o is surely not the limit since vmstat output
shows more or less zero physical disk-activity ..
procs memory page disk faults Cpu
r b w avm fre flt re pi po fr sr ad0 in sy cs us sy id
3 1 0 308564 42904 24 0 0 0 24 0 0 458 76309 1228 42 58 0
6 1 0 308564 42904 1 0 0 0 0 0 13 426 77517 1032 35 65 0
4 1 0 308564 42904 24 0 0 0 24 0 0 409 77459 942 32 68 0
5 1 0 308564 42904 0 0 0 0 0 0 0 405 77817 919 33 67 0
3 1 0 308564 42904 24 0 0 0 24 0 0 392 77466 891 26 74 0
7 1 0 308564 42904 0 0 0 0 0 0 0 448 77276 1096 41 59 0
4 1 0 308564 42888 24 0 0 0 24 0 0 451 77003 1160 36 64 0
5 1 0 308564 42888 0 0 0 0 0 0 3 409 77546 933 38 62 0
3 1 0 308564 42888 24 0 0 0 24 0 0 404 77355 938 38 6
its completely cpu-bound (no wonder 77000 syscalls/s surely has its price)
for the sake of it, i then tried (on a freshly restored backup) to delete
just some 20000 records from that table, and suddenly i get a result ..
but it takes a whopping 880 seconds ... see:
SQL> delete from accountrecord where accr_id >= 580000 and accr_id <= 600000;
PLAN (ACCOUNTRECORD NATURAL)
Current memory = 688420
Delta memory = 68940
Max memory = 727492
Elapsed time= 8.59 sec
Cpu = 6.00 sec
Buffers = 75
Reads = 72822
Writes = 316
Fetches = 4792780
SQL> commit;
Current memory = 649112
Delta memory = -39308
Max memory = 727492
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 2
Fetches = 2
new client connects .... and runs the select count(*)
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
2278585
Current memory = 3984248
Delta memory = 8824
Max memory = 4038016
Elapsed time= 880.21 sec
Cpu = 260.32 sec
Buffers = 823
Reads = 32335520
Writes = 122096
Fetches = 37167441
woa?! look at the stats! .. why is it doing 32 mio
reads on a database that has just a bit over 100000 pages??
for completeness then i tried the original large delete
over the weekend ... and the results are .. well ..
lets just say "they are not impressive" ;-)
SQL> delete from accountrecord where accr_val_year >= 2003;
PLAN (ACCOUNTRECORD NATURAL)
Current memory = 2743632
Delta memory = 2133044
Max memory = 2743632
Elapsed time= 16.47 sec
Cpu = 12.82 sec
Buffers = 75
Reads = 72822
Writes = 54811
Fetches = 9932188
SQL> commit;
Current memory = 649112
Delta memory = -2094520
Max memory = 2749516
Elapsed time= 0.03 sec
Cpu = 0.00 sec
Buffers = 75
Reads = 2
Writes = 74
Fetches = 2
SQL> quit;
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> show database;
Database: /tmp/testdb.gdb
Owner: SYSDBA
PAGE_SIZE 4096
Number of DB pages allocated = 119033
Sweep interval = 20000
Forced Writes are OFF
Transaction - oldest = 113
Transaction - oldest active = 114
bash-3.00# isql /tmp/testdb.gdb
Database: /tmp/testdb.gdb
SQL> set stats;
SQL> set plan;
SQL> select count(*) from accountrecord;
PLAN (ACCOUNTRECORD NATURAL)
COUNT
============
585449
Current memory = 3916100
Delta memory = 3447500
Max memory = 3917268
Elapsed time= 51155.54 sec
Cpu = 16832.51 sec
Buffers = 819
Reads = 2139707744
Writes = 5929156
Fetches = -2128407134 <<<---- wtf?
imho .. something is pretty wrong here; i can understand that the
database does some sort of garbage collection after large deletes
but 50000 seconds (with 100% cpu all the time)?!?!
i mean ... within that time frame i can print that database
to paper and do the select() by hand ... :-)
Any ideas/suggestions greatly appreciated ...
if you need more info pls tell me ... i'll try to provide whatever is
necessary .. we need to get this fixed, otherwise i fear we have to jump
ship, since waiting 14 hours for a simple select every once in a while is
clearly unacceptable ...
regards,
alex.