firebird-architect - Re[2]: [Firebird-Architect] FW: Deadlocks on FB1.5

Subject	Re[2]: [Firebird-Architect] FW: Deadlocks on FB1.5
Author	Nickolay Samofatov
Post date	2003-05-14T15:58:25Z

Hello, Jim,

> At 04:17 AM 5/14/03 -0400, Claudio Valderrama C. wrote:
>>Nickolay Samofatov wrote in fb-devel:
>> >
>> >
>> > In Firebird 1.5 I changed the behaviour to report all deadlocks
>> > (including internal ones) instantly, without any timeouts.
>> > This change allows to use firebird under constant heavy load and have
>> > a predictable response time.
>>
>>
>>Nickolay, can you tell me when was this issue discussed before being
>>changed? I decided to promote the discussion here because it seems an
>>architecture decision, not a simple code change. And I only became fully
>>aware of the change when an app developer reported a new behavior, even if I
>>read all the changes to the CVS.

> The problem here is that in classic, not all deadlocks are actually deadlocks.
> In most cases, delivery of a blocking AST will cause a lock to be downgraded,
> clearing the deadlock. But for this to happen, the process holding the lock
> must receive and process the AST. In a heavily loaded system any number
> of things can delay the process.

1. Deadlock is not declared to be deadlock unless all blocking owners
received ASTs (there is special flag for that - OWN_signaled).
10 second (by default) wait cycle begins then.
2. My code begins deadlock scan cycle as soon all blocking owners
finished processing ASTs (and this applies to CS too - special signal
is sent to blocked owner to make it notice that last blocking owner
finished AST processing).
3. DPM/VIO code has a lot of places where deadlocks are possible.
Many places in DPM rely on fixed 1 second timeouts to detect and process
deadlocks. The problem is that deadlocks are now getting reported before
1 second timeout and page locking retry loop is looping too quickly.
If I add manual 1-second timeout in those loops everything will work as
it worked before.
4. There is a bug that if locking problem happens during transaction-level
savepoint backout during rollback engine reports BUGCHECK(290) instead of
declaring transaction as DEAD and continue working.

> And since the physical access paths are "known" to be deadlock free, there is
> no code to recover from transient deadlocks.

Both parts of this sentence are wrong :(
Look at dpm.cpp functions DPM_delete, DPM_fetch_back, extend_relation,
mark_full - almost any function that does page-level access have code to
handle deadlock that it can cause.

> Superserver is a different animal. But how different is restrained by an
> unnatural common code base with classic.

There is no big problem with that. Oracle also have common codebase
for their Classic server and Multi-Threaded engine. No problems they
have.

> Jim Starkey

Nickolay Samofatov