firebird-support - Re: [firebird-support] Periodic database slowdown

Subject	Re: [firebird-support] Periodic database slowdown - troubleshooting steps?
Author	Thomas Steinmaurer
Post date	2012-09-17T19:28:41Z

Hi Bob,

> I have an FB 2.1.5 Classic server running on a Windows 2003 server,
> with a single hard drive for the operating system, and a 3 disk raid 5
> array for the database. We have one database on this machine, which
> is a dialect 1 database that was started on IB6.0 many years ago,
> currently at 90GB. We have sweep disabled, and each night run gbak,
> gfix –sweep, as well as reindex all tables via a script.
>
>
>
> The database has very little OLTP, and is mostly used for reporting
> and serving web pages to internal business users. We do alot of ETL
> starting very early each morning, and create a mix of scheduled
> reports as well as allowing users to specify parameters to create
> pre-designed reports in an ad-hoc manner.
>
>
>
> Once or twice per month, the system slows down tremendously. One ETL
> process typically runs at a pace of about 1000 records per 10 seconds.
> During these slow periods, the same ETL will run 1000 transactions per
> 60-80 seconds. When processing a file with 1mil+ records, this slow
> down costs us hours.
>
>
>
> I have not been able to determine the reason for these slow periods.
> They do not coincide with higher cpu or disk usage – most of the time
> I'm seeing very little usage of anything – disk/cpu/network/memory. I
> do see more connections to the database during these periods –
> typically we have 10 to 15 connections, and the number may double
> during the problem times. This is due to the fact that the reports
> that users are requesting are taking longer to run, and our
> pooled-connection application server or web server creates more
> connections to satisfy new user requests.
>
>
>
> I do see more queries running by the time I am notified of the
> problem. Again though, this is a coincidence of the slowdown and not
> the cause – queries are taking longer to run, and therefore we have
> more chance of overlapping user requests than usual.
>
>
>
> Most of the time we need to reboot once or even twice to fix the
> problem. That is not a viable long-term solution though, and I'm
> looking for more ideas to determine what may be happening. Any ideas
> would be most helpful.
>
>
>
> I have included gstat –h output of the database as it is suffering
> from the issue below:
>
>
>
> Database header page information:
>
> Flags 0
>
> Checksum 12345
>
> Generation 43136192
>
> Page size 8192
>
> ODS version 11.1
>
> Oldest transaction 40789582
>
> Oldest active 41467442
>
> Oldest snapshot 41467442
>
> Next transaction 42431040
>
> Bumped transaction 1
>
> Sequence number 0
>
> Next attachment ID 705070
>
> Implementation ID 16
>
> Shadow count 0
>
> Page buffers 2048
>
> Next header page 0
>
> Database dialect 1
>
> Creation date May 2, 2009 22:22:39
>
> Attributes force write, no reserve
>
>
>
> Variable header data:
>
> Sweep interval: 0
>
> *END*

1): The most obvious thing according to the header page is a very large
gap between the oldest active transaction and the next transaction. This
means, you have a long-running/stuck transaction. If you are lucky, you
can go into the MON$TRANSACTIONS table and check out if you find the
MON$TRANSACTION_ID for 41467442. "Lucky", because I saw occasions where
the OAT according to the header page isn't available in the monitoring
tables. Perhaps some client (ETL tool?) doesn't behave well from a
client transaction management POV.

2): Although you say you aren't in an OLTP pattern here, I guess due to
ETL, it isn't a read-only database, right? If so, running the database
in "no reserve" mode isn't a good idea, because, basically you are
telling Firebird to not reserve space for back record version on the
same data page as the primary record version. This results in more reads
from disk, especially in a reporting scenario where you have
long-running read-write transactions/queries, where concurrent
read/write requests generate a longer back record chain until it can be
removed via co-operative GC (the only GC mode in CS). While gfix can be
used to remove the "no reserve" thing, this doesn't change the layout of
already allocated data pages. If you have a maintainence window, I would
go with a backup/restore cycle to re-build the database with "reserve"
(the default, btw, thus you don't have to provide anything special for
that) from scratch. Might be a challenge for a 90GB database and a small
maintenance window. A few tricks to shorten the offline window:

* Run both, backup and restore through the services API. When using
gbak, this can be done via the -service switch. This results in not
going through the TCP stack, which can improve performance a lot.

* Backup the database with the -g option, because this suppress garbage
collection in the source database

* If enough RAM is available, restore the database with a MUCH higher
page buffers value as 2048, because this can speed up index creation
during a restore a lot. E.g. 100000, with a page size of 8K, this means
~800MB RAM for the page cache for this single restore connection only.
Use it with caution and don't forget to set it to the original value
after the restore!!!

* If you have a spare SSD, even if it is only a cheap consumer SSD, make
use of it for both, backup and restore.

3:) As you are talking about reporting, make use of read-only
transactions. Even better would be a combination of read-only
transaction in read committed isolation mode, but read committed might
be problematic in a reporting scenario, when you need a stable snapshot
of the underlaying data for the period of report generation.

4:) Keep an eye on the fb_lock_print output to possibly increase the
default hash slot value.

5:) Try to run gfix -sweep at a time, when there is zero or close to
zero load.

With 2.1.x you have the following troubleshooting steps:

* Run gstat -h and watch the transaction counters regarding big gaps
* Use the various MON$ tables to get a picture on what's currently
running in my database, including active transactions, I/O stats etc.
* Use fb_lock_print

With 2.5, you would get another monitoring facility, namely the Trace
API. Highly recommended, thus when you are able to run your stuff
through a 2.5 instance in a test environment, you could benefit from the
Trace API as well, to spot slow stuff etc.

The performance area in Firebird is pretty interesting and could be a
topic for a stand-alone book dedicated to peformance, a half/full-day
workshop, support on a commercial base etc.

--
With regards,
Thomas Steinmaurer
http://www.upscene.com/