Subject | Re: [firebird-support] Periodic database slowdown - troubleshooting steps? |
---|---|
Author | Thomas Steinmaurer |
Post date | 2012-09-18T06:28:40Z |
Hello Bob,
it could be due to a crashed server or possibly a transaction in limbo,
in case you are using distributed transactions. You can use gfix to
check if there are transactions in limbo.
Other than that, a backup/restore will cure this situation.
gfix, during a gbak restore or even through client applications and the
services API. Are you using IBObjects by any chance?
* Disconnect all users from the database
* Install a trigger-based logging mechanism (I won't do a product plug
here *g*)
* Start the gbak backup
* Allow users to connect again
This way, all DML after the backup has been started will be captured in
the source database and can be re-applied on the restored database at a
later point. This could cut downtime a lot. Although, never did that in
production, but it should work.
should backup to the local server, if this is possible, and then move
the backup file to the remote server.
--
With regards,
Thomas Steinmaurer
http://www.upscene.com/
>>> 1): The most obvious thing according to the header page is a veryHaven't got an explanation why OAT isn't visible in the mon tables, but
> large
>>> gap between the oldest active transaction and the next transaction.
> This
>>> means, you have a long-running/stuck transaction. If you are lucky,
> you
>>> can go into the MON$TRANSACTIONS table and check out if you find
> the
>>> MON$TRANSACTION_ID for 41467442. "Lucky", because I saw occasions
> where
>>> the OAT according to the header page isn't available in the
> monitoring
>>> tables. Perhaps some client (ETL tool?) doesn't behave well from a
>>> client transaction management POV.
>
> No such luck - 42450558 is the earliest of the 29 records listed.
it could be due to a crashed server or possibly a transaction in limbo,
in case you are using distributed transactions. You can use gfix to
check if there are transactions in limbo.
Other than that, a backup/restore will cure this situation.
>>> 2): Although you say you aren't in an OLTP pattern here, I guessNo reserve, isn't and never wasn't the default. It can be changed via
> due to
>>> ETL, it isn't a read-only database, right? If so, running the
> database
>>> in "no reserve" mode isn't a good idea, because, basically you are
>>> telling Firebird to not reserve space for back record version on
> the
>>> same data page as the primary record version. This results in more
> reads
>>> from disk, especially in a reporting scenario where you have
>>> long-running read-write transactions/queries, where concurrent
>>> read/write requests generate a longer back record chain until it
> can be
>>> removed via co-operative GC (the only GC mode in CS).
>
> I have definitely never used the "no reserve" option. I wonder if it
> was a default on an earlier version of the server that just carried
> over. I'll use gfix to use reserve to at least deal with those tables
> that are emptied and overwritten regularly.
gfix, during a gbak restore or even through client applications and the
services API. Are you using IBObjects by any chance?
>>> While gfix can be used to remove the "no reserve" thing, thisOk. Another scenario/option could be:
> doesn't change the layout of
>>> already allocated data pages. If you have a maintainence window, I
> would
>>> go with a backup/restore cycle to re-build the database with
> "reserve"
>>> (the default, btw, thus you don't have to provide anything special
> for
>>> that) from scratch. Might be a challenge for a 90GB database and a
> small
>>> maintenance window.
>
> That has been a problem for a very long time. Right now, a full
> backup/restore cycle is taking more than 24 hours, and at best we only
> have a 12 hour window at best on a Sunday. Hence the May 2009
> creation date of the current DB.
* Disconnect all users from the database
* Install a trigger-based logging mechanism (I won't do a product plug
here *g*)
* Start the gbak backup
* Allow users to connect again
This way, all DML after the backup has been started will be captured in
the source database and can be re-applied on the restored database at a
later point. This could cut downtime a lot. Although, never did that in
production, but it should work.
>>> A few tricks to shorten the offline window:Don't have any concrete numbers, but to improve the backup process, you
>
>>> * Run both, backup and restore through the services API. When using
>>> gbak, this can be done via the -service switch. This results in not
>
>>> going through the TCP stack, which can improve performance a lot.
>
> That's a good trick, but since we are backing up to a seperate server
> the gbak -b can't use the service switch. Since we are restoring
> locally on the second server I could use that switch, but instead we
> are using the embedded gbak. Using embedded is definitely faster than
> regular gbak -c, but I'm curious as to whether -service is faster. I
> would assume that they are probably about the same.
should backup to the local server, if this is possible, and then move
the backup file to the remote server.
>>> * Backup the database with the -g option, because this suppressCould be the reason why the sweep isn't that efficient.
> garbage
>>> collection in the source database
>
> This is standard practice when planning on replacing the database.
>
>
>>> * If enough RAM is available, restore the database with a MUCH
> higher
>>> page buffers value as 2048, because this can speed up index
> creation
>>> during a restore a lot. E.g. 100000, with a page size of 8K, this
> means
>>> ~800MB RAM for the page cache for this single restore connection
> only.
>>> Use it with caution and don't forget to set it to the original
> value
>>> after the restore!!!
>
> Good suggestion, I'm going to try that tonight.
>
>
>>> * If you have a spare SSD, even if it is only a cheap consumer SSD,
> make
>>> use of it for both, backup and restore.
>
> Unfortunately it's a corporate datacenter with fixed configurations,
> so no goodies like SSD's.
>
>
>>> 3:) As you are talking about reporting, make use of read-only
>>> transactions. Even better would be a combination of read-only
>>> transaction in read committed isolation mode, but read committed
> might
>>> be problematic in a reporting scenario, when you need a stable
> snapshot
>>> of the underlaying data for the period of report generation.
>
> Very good points!
>
>
>>> 4:) Keep an eye on the fb_lock_print output to possibly increase
> the
>>> default hash slot value.
>
>>> 5:) Try to run gfix -sweep at a time, when there is zero or close
> to
>>> zero load.
>
> Yes, we run it at night just before the backup kicks off.
> Unfortunately, there is overlap because the sweep usually takes about
> 2.5 hours.
--
With regards,
Thomas Steinmaurer
http://www.upscene.com/