firebird-architect - Re: [Firebird-Architect] FW: Deadlocks on FB1.5

Subject	Re: [Firebird-Architect] FW: Deadlocks on FB1.5
Author	Nickolay Samofatov
Post date	2003-05-14T10:49Z

Hello, Claudio,

> Nickolay Samofatov wrote in fb-devel:
>>
>> Firebird 1.0 and earler versions handled internal page-level deadlocks
>> internally in a very interesting way. In case of internal deadlocks it
>> stopped _ALL_ contented processes for 10 seconds (DeadlockTimeout
>> paremeter value) and then retried several times (possibly waiting
>> again) before reporting internal deadlock to client. This caused
>> server to collapse under load. For Firebird 1.0 this looked like
>> this: under heavy load engine may enter state when clients have
>> response time about 10-20 seconds for every query and server CPU's
>> are not loaded at all.
>>
>> In Firebird 1.5 I changed the behaviour to report all deadlocks
>> (including internal ones) instantly, without any timeouts.
>> This change allows to use firebird under constant heavy load and have
>> a predictable response time.

> Nickolay, can you tell me when was this issue discussed before being
> changed? I decided to promote the discussion here because it seems an
> architecture decision, not a simple code change. And I only became fully
> aware of the change when an app developer reported a new behavior, even if I
> read all the changes to the CVS.

This is not a big issue (a matter of 15-20 lines of code). The problem is that
it revealed a couple unstable places in Firebird code and they also need to be fixed.

> This is a low level behavior that -good or bad- has been for years and
> changing it should fit the needs of most people, not only one developer.

It will benefit all the developers.

> - Is your reasoning/solution valid for all environments? Or only in the case
> of several clients smashing the engine with quick and brief requests?

It is valid for all environments.

> - Do all kinds of application get a benefit by being informed of ALL
> deadlocks, as you wrote?

Due to a few design issues in VIO code internal deadlocks are always
possible. And my applications got them many times, but before
reporting them Firebird server became essentially unconscious.

After my VIO timings fix the probability of application to get
internal deadlock will be the same as in Firebird 1.0.

> - What's the performance penalty for scanning immediately for deadlocks
> instead of waiting some seconds like it was before?

Almost none. As Firebird lock manager doesn't support asynchronous
waiting on locks and maximum number of blocking locks <=
number of connections (plus possibly a few for AST and GC activity).
Only blocking locks are scanned for deadlock cycles.
In fact, things are a little more complicated because several owners
may exist for a single resource, but anyway scanning for deadlocks is
a usually matter of checking a handful nodes (up to a few hundred in
theory).

On SS architecture penalty is only caused by extra deadlock scan
before going to sleep waiting for lock (because ASTs are delivered
synchronously). On CS one extra signal gets issued when all blocking
owners processed AST notifications to initiate deadlock scan.

> - What are the internal deadlocks that are now passed to the application?
> Isn't this overkill? Most deadlocks may be resolved quickly with an internal
> "let's retry" instead of bothering the poor application.

The problem is that my change caused change of lock manager timings
and this made one race condition in VIO code appear more often then
earler (and another old error, in transaction rollback code produces
BUGCHECK(290) instead of normal transaction rollback).
I will fix both problems ASAP. Simple solution would be to introduce
artificial delay into internal page locking retry loop, but I also want
to make transaction rollback code a little bit more robust.

VIO page locking code needs review after 1.5 release because now it
produces internal deadlocks very often (and it does that since
Interbase 6.0 at least).

> Should this behavior be configurable instead?

Probably no because as soon I fix all the side effects that appeared
noone will notice any difference except faster deadlock reporting.

Nickolay Samofatov