Subject Re: [firebird-support] [FB 2.1] Firebird engine seems to slow down on high load without utilizing hardware
Author Thomas Steinmaurer
Hi Patrick,

> Hi Thomas, nice to get a response from you. We already met in ~2010 in Linz at
> your office :)
> (ex. SEM GmbH, later Playmonitor GmbH)

I know. XING (Big Brother) is watching you. Nice to see that you are still running with Firebird. ;-)


> First, sorry for posting a mixed state of informations. The config settings i
> postet are the current settings.
> But the Lock-Table-Header was from last saturday (day of total system crash) -
> we changed Hash Slot Value since than, but it didn't work. New Table looks
> like:
>
>
> LOCK_HEADER BLOCK
> Version: 16, Active owner: 0, Length: 134247728, Used: 55790260
> Semmask: 0x0, Flags: 0x0001
> Enqs: 1806423519, Converts: 4553851, Rejects: 5134185, Blocks: 56585419
> Deadlock scans: 82, Deadlocks: 0, Scan interval: 10
> Acquires: 2058846891, Acquire blocks: 321584126, Spin count: 0
> Mutex wait: 15.6%
> Hash slots: 20011, Hash lengths (min/avg/max): 0/ 7/ 18
> Remove node: 0, Insert queue: 0, Insert prior: 0
> Owners (297): forward: 385160, backward: 38086352
> Free owners (43): forward: 52978748, backward: 20505128
> Free locks (41802): forward: 180712, backward: 3620136
> Free requests (-1097572396): forward: 46948676, backward: 13681252
> Lock Ordering: Enabled
>
>
> The Min/Avg/Max hash lengths look better now, but as you mentioned the Mutex
> wait is worring us too.
> We have 2 direct questions about that.
>
>
> 1) What are the negative effects of increasing Hash-Slots (too high)?

It somehow defines the initial size of a hash table which is used for lock(ed) object lookup by a key (= hash value), ideally with constant O(1) run-time complexity. If the hash table is too small, due to a too small value for hash slots, it starts to degenerate into a linked/linear list per hash slot. Worst case resulting in O(n) complexity for lookups. The above 20011 setting shows an AVG hash length which looks fine.

As you might know, Classic having a dedicated process per connection model somehow needs a (global) mechanism to synchronize/protect shared data structures across these processes via IPC. This is what the lock manager and the lock table is used for.

> 2) As far as we know, we can't influence Mutex wait directly (it's just
> informational). But do you think that's the reason the underlying hardware is
> not utilized?

I don't think you are disk IO bound. Means, I'm not convinced that faster IO will help. Somehow backed by the high mutex wait. Under normal operations you see 100-500 IOPS with some room for further increase as shown in the 1700 IOPS backup use case. Don't know how random disk IO is in this two scenarios. Any chance to run some sort of disk IO benchmarks or do you already know your upper limits for your SAN IOPS wise?

>
>
> We do consider to upgrade to 2.5, but had our eyes on FB 3 over the last year,
> waiting for it to get ready.
> With 2.5.x we tested around a long time now, but never found a real reason to
> upgrade - since it's a reasonable amount of work for us. When you say it
> improves the lock contention, this sound pretty good. But again the question,
> do you think lock contention is limiting our system?

Dmitry, Vlad etc. will correct me (in case he is following the thread), but I recall that in 2.5, especially in SuperClassic being multi-threaded per worker process compared to Classic, now also allows specific(?) lock manager operations in parallel to regular request processing. In general I remember a mentioned improvement of ~25% in a TPC-C style workload with SuperClassic compared to Classic.

>
>
> First and foremost, we would really like to find the bottleneck. We just don't
> have the know-how to imagine something like "Fb 2.1 Engine is limiting us
> because of ..." and without that knowledge it's hard to take actions like
> upgrading to 2.5.
>
>
> We'll try to collect information about the garbage we create :) We do run
> "Sinatica Monitoring" on the server, which shows us "Awaiting Gargabe
> Collection" Transactions. Is that the information you'r looking for?

I'm not familiar with Sinatica. Perhaps the periodic MON$ queries (how frequent are they executed by Sinatica?) also produce some sort of overhead, cause each MON$ table query in context of a new physical transaction results in a stable view of current activity. Possibly not neglectable with > 400 connections.

The most easiest way to get insights on your record garbage is, e.g.:

* Run gstat -r
* Run a tool from IBSurgeon (can't recall the name, Alexey?)
* Run a tool from Upscene (FB TraceManager)

>
> Maybe to avoid confusion, we don't have normal "Spikes" .. the system just
> starts to slow down and this state remains until the server-load is gone (after
> midnight, when software is not used anymore).



--
With regards,
Thomas Steinmaurer
http://www.upscene.com

Professional Tools and Services for Firebird
FB TraceManager, IB LogManager, Database Health Check, Tuning etc.