Subject RE: [firebird-support] Periodic database slowdown - troubleshooting steps?
Author Bob Murdoch
Thomas -

-----Original Message-----
From: firebird-support@yahoogroups.com
[mailto:firebird-support@yahoogroups.com] On Behalf Of Thomas
Steinmaurer


>> 1): The most obvious thing according to the header page is a very
large
>> gap between the oldest active transaction and the next transaction.
This
>> means, you have a long-running/stuck transaction. If you are lucky,
you
>> can go into the MON$TRANSACTIONS table and check out if you find
the
>> MON$TRANSACTION_ID for 41467442. "Lucky", because I saw occasions
where
>> the OAT according to the header page isn't available in the
monitoring
>> tables. Perhaps some client (ETL tool?) doesn't behave well from a
>> client transaction management POV.

No such luck - 42450558 is the earliest of the 29 records listed.



>> 2): Although you say you aren't in an OLTP pattern here, I guess
due to
>> ETL, it isn't a read-only database, right? If so, running the
database
>> in "no reserve" mode isn't a good idea, because, basically you are
>> telling Firebird to not reserve space for back record version on
the
>> same data page as the primary record version. This results in more
reads
>> from disk, especially in a reporting scenario where you have
>> long-running read-write transactions/queries, where concurrent
>> read/write requests generate a longer back record chain until it
can be
>> removed via co-operative GC (the only GC mode in CS).

I have definitely never used the "no reserve" option. I wonder if it
was a default on an earlier version of the server that just carried
over. I'll use gfix to use reserve to at least deal with those tables
that are emptied and overwritten regularly.



>> While gfix can be used to remove the "no reserve" thing, this
doesn't change the layout of
>> already allocated data pages. If you have a maintainence window, I
would
>> go with a backup/restore cycle to re-build the database with
"reserve"
>> (the default, btw, thus you don't have to provide anything special
for
>> that) from scratch. Might be a challenge for a 90GB database and a
small
>> maintenance window.

That has been a problem for a very long time. Right now, a full
backup/restore cycle is taking more than 24 hours, and at best we only
have a 12 hour window at best on a Sunday. Hence the May 2009
creation date of the current DB.


>> A few tricks to shorten the offline window:

>>* Run both, backup and restore through the services API. When using
>> gbak, this can be done via the -service switch. This results in not

>> going through the TCP stack, which can improve performance a lot.

That's a good trick, but since we are backing up to a seperate server
the gbak -b can't use the service switch. Since we are restoring
locally on the second server I could use that switch, but instead we
are using the embedded gbak. Using embedded is definitely faster than
regular gbak -c, but I'm curious as to whether -service is faster. I
would assume that they are probably about the same.


>> * Backup the database with the -g option, because this suppress
garbage
>> collection in the source database

This is standard practice when planning on replacing the database.


>> * If enough RAM is available, restore the database with a MUCH
higher
>> page buffers value as 2048, because this can speed up index
creation
>> during a restore a lot. E.g. 100000, with a page size of 8K, this
means
>> ~800MB RAM for the page cache for this single restore connection
only.
>> Use it with caution and don't forget to set it to the original
value
>> after the restore!!!

Good suggestion, I'm going to try that tonight.


>> * If you have a spare SSD, even if it is only a cheap consumer SSD,
make
>> use of it for both, backup and restore.

Unfortunately it's a corporate datacenter with fixed configurations,
so no goodies like SSD's.


>> 3:) As you are talking about reporting, make use of read-only
>> transactions. Even better would be a combination of read-only
>> transaction in read committed isolation mode, but read committed
might
>> be problematic in a reporting scenario, when you need a stable
snapshot
>> of the underlaying data for the period of report generation.

Very good points!


>> 4:) Keep an eye on the fb_lock_print output to possibly increase
the
>> default hash slot value.

>> 5:) Try to run gfix -sweep at a time, when there is zero or close
to
>> zero load.

Yes, we run it at night just before the backup kicks off.
Unfortunately, there is overlap because the sweep usually takes about
2.5 hours.


Thanks,

Bob M..