Subject after 2.1 to 2.5 migration, nbackup seems to slow operations unusually
Author unordained
Last night, we upgraded:

SS 32-bit 2.1.5 -> SS 32-bit 2.5.2(sec1), on 64-bit Windows Server 2008

For the most part, things are okay, but when nbackup is running (level-2 backups
every hour, takes about 10 minutes), I'm seeing something I haven't historically
seen: concurrent queries take enormously longer to complete than usual. I'm used
to a little bump in the average with no significant outliers, but now I'm seeing
stuff hang as long as 4 or 5 minutes, that used to maybe take 200ms extra during
backups. Once the backup is finished, it all goes back to normal. It doesn't
seem to force queries to wait until the backup is finished, it just delays them
a lot. (Doesn't seem to be right when it merges the delta file back in, which
I've noticed before does tend to nearly freeze everything.) I see slowness in
getting connections / starting transactions, in reads, writes... it's not just
one aspect.

(We took care of the quota-upping permission late this morning and bounced the
service, to fix the filesystem cache thrashing we saw for the first few rounds
of backups -- but while that significantly reduced CPU/RAM load during the
backups, it didn't fully resolve the concurrent-query issues.)

Database size is still pretty much the same, still takes same amount of time to
run the backup; level-0 and level-1 are recent (from right after the
backup/restore to upgrade the ODS) and so the level-2 files are about the same
size they usually are. It doesn't seem like nbackup is doing its job faster than
before (with more side-effects).

We reset our firebird.conf settings to default as part of the upgrade (we didn't
have justification for why we had made the tweaks we had, like hash-slots to
5099, increasing the cache size and lock table size).

But I'm unsure what setting might be involved in this case, and while I'm
willing to spend the next week tweaking one setting each day, I was hoping
someone might have a clue. I didn't see any JIRA items that would explain it as
a bug.

The lock stats [taken during nbackup] look fairly reasonable to me, but I can't
find hard guidelines:

LOCK_HEADER BLOCK
Version: 17, Active owner: 0, Length: 2097152, Used: 1638408
Flags: 0x0001
Enqs: 1693135, Converts: 102884, Rejects: 261, Blocks: 0
Deadlock scans: 0, Deadlocks: 0, Scan interval: 10
Acquires: 3869863, Acquire blocks: 0, Spin count: 0
Mutex wait: 0.0%
Hash slots: 1009, Hash lengths (min/avg/max): 5/ 10/ 20
Remove node: 0, Insert queue: 0, Insert prior: 0
Owners (17): forward: 18744, backward: 1294644
Free owners (60): forward: 1298964, backward: 1298132
Free locks (172): forward: 1292740, backward: 1405480
Free requests (2688): forward: 1474112, backward: 1463932
Lock Ordering: Enabled

What would you guys try changing first? What would you recommend I look at?

(We have seen other improvements in 2.5 that we are VERY happy with. CORE-1477
was the main reason for upgrading, and our nightly jobs no longer crash the
server at 2GB, and only take half as long as before. Yay! Thank you!)

-Philip