firebird-support - Re: [firebird-support] Periodic database slowdown

Subject	Re: [firebird-support] Periodic database slowdown - troubleshooting steps?
Author	Thomas Steinmaurer
Post date	2012-09-18T06:28:40Z

Hello Bob,

>>> 1): The most obvious thing according to the header page is a very
> large
>>> gap between the oldest active transaction and the next transaction.
> This
>>> means, you have a long-running/stuck transaction. If you are lucky,
> you
>>> can go into the MON$TRANSACTIONS table and check out if you find
> the
>>> MON$TRANSACTION_ID for 41467442. "Lucky", because I saw occasions
> where
>>> the OAT according to the header page isn't available in the
> monitoring
>>> tables. Perhaps some client (ETL tool?) doesn't behave well from a
>>> client transaction management POV.
>
> No such luck - 42450558 is the earliest of the 29 records listed.

Haven't got an explanation why OAT isn't visible in the mon tables, but
it could be due to a crashed server or possibly a transaction in limbo,
in case you are using distributed transactions. You can use gfix to
check if there are transactions in limbo.

Other than that, a backup/restore will cure this situation.

>>> 2): Although you say you aren't in an OLTP pattern here, I guess
> due to
>>> ETL, it isn't a read-only database, right? If so, running the
> database
>>> in "no reserve" mode isn't a good idea, because, basically you are
>>> telling Firebird to not reserve space for back record version on
> the
>>> same data page as the primary record version. This results in more
> reads
>>> from disk, especially in a reporting scenario where you have
>>> long-running read-write transactions/queries, where concurrent
>>> read/write requests generate a longer back record chain until it
> can be
>>> removed via co-operative GC (the only GC mode in CS).
>
> I have definitely never used the "no reserve" option. I wonder if it
> was a default on an earlier version of the server that just carried
> over. I'll use gfix to use reserve to at least deal with those tables
> that are emptied and overwritten regularly.

No reserve, isn't and never wasn't the default. It can be changed via
gfix, during a gbak restore or even through client applications and the
services API. Are you using IBObjects by any chance?

>>> While gfix can be used to remove the "no reserve" thing, this
> doesn't change the layout of
>>> already allocated data pages. If you have a maintainence window, I
> would
>>> go with a backup/restore cycle to re-build the database with
> "reserve"
>>> (the default, btw, thus you don't have to provide anything special
> for
>>> that) from scratch. Might be a challenge for a 90GB database and a
> small
>>> maintenance window.
>
> That has been a problem for a very long time. Right now, a full
> backup/restore cycle is taking more than 24 hours, and at best we only
> have a 12 hour window at best on a Sunday. Hence the May 2009
> creation date of the current DB.

Ok. Another scenario/option could be:

* Disconnect all users from the database
* Install a trigger-based logging mechanism (I won't do a product plug
here *g*)
* Start the gbak backup
* Allow users to connect again

This way, all DML after the backup has been started will be captured in
the source database and can be re-applied on the restored database at a
later point. This could cut downtime a lot. Although, never did that in
production, but it should work.

>>> A few tricks to shorten the offline window:
>
>>> * Run both, backup and restore through the services API. When using
>>> gbak, this can be done via the -service switch. This results in not
>
>>> going through the TCP stack, which can improve performance a lot.
>
> That's a good trick, but since we are backing up to a seperate server
> the gbak -b can't use the service switch. Since we are restoring
> locally on the second server I could use that switch, but instead we
> are using the embedded gbak. Using embedded is definitely faster than
> regular gbak -c, but I'm curious as to whether -service is faster. I
> would assume that they are probably about the same.

Don't have any concrete numbers, but to improve the backup process, you
should backup to the local server, if this is possible, and then move
the backup file to the remote server.

>>> * Backup the database with the -g option, because this suppress
> garbage
>>> collection in the source database
>
> This is standard practice when planning on replacing the database.
>
>
>>> * If enough RAM is available, restore the database with a MUCH
> higher
>>> page buffers value as 2048, because this can speed up index
> creation
>>> during a restore a lot. E.g. 100000, with a page size of 8K, this
> means
>>> ~800MB RAM for the page cache for this single restore connection
> only.
>>> Use it with caution and don't forget to set it to the original
> value
>>> after the restore!!!
>
> Good suggestion, I'm going to try that tonight.
>
>
>>> * If you have a spare SSD, even if it is only a cheap consumer SSD,
> make
>>> use of it for both, backup and restore.
>
> Unfortunately it's a corporate datacenter with fixed configurations,
> so no goodies like SSD's.
>
>
>>> 3:) As you are talking about reporting, make use of read-only
>>> transactions. Even better would be a combination of read-only
>>> transaction in read committed isolation mode, but read committed
> might
>>> be problematic in a reporting scenario, when you need a stable
> snapshot
>>> of the underlaying data for the period of report generation.
>
> Very good points!
>
>
>>> 4:) Keep an eye on the fb_lock_print output to possibly increase
> the
>>> default hash slot value.
>
>>> 5:) Try to run gfix -sweep at a time, when there is zero or close
> to
>>> zero load.
>
> Yes, we run it at night just before the backup kicks off.
> Unfortunately, there is overlap because the sweep usually takes about
> 2.5 hours.

Could be the reason why the sweep isn't that efficient.

--
With regards,
Thomas Steinmaurer
http://www.upscene.com/