Subject Re[3]: [Firebird-Architect] FW: Deadlocks on FB1.5
Author Nickolay Samofatov
Hello, Jim,

> At 07:58 PM 5/14/03 +0400, Nickolay Samofatov wrote:


>>3. DPM/VIO code has a lot of places where deadlocks are possible.
>>Many places in DPM rely on fixed 1 second timeouts to detect and process
>>deadlocks. The problem is that deadlocks are now getting reported before
>>1 second timeout and page locking retry loop is looping too quickly.
>>If I add manual 1-second timeout in those loops everything will work as
>>it worked before.


> Interesting. Didn't used to work that way. It was original designed to be
> logically deadlock free, meaning 1) any physical deadlock was a serious
> logical error that would almost certainly result in corrupted databases,
> and b) deadlock recover code was not only not needed but couldn't be
> tested.

> A bona fide deadlock in the physical access paths means the database
> isn't careful write. And if it's not careful write, it's not guaranteed valid
> after a system crash.

I still do not have a perfect understanding of DPM operation, but what
I see is:

I. engine dies (or begins to crawl extremely slowly) under contention
inside of the deadlock-detection loops. The reason of this behaviour
is following logic (common in DPM code):

1. get lock 1
2. try to get lock 2 with timeout of 1 second
3. if timeout expired release lock 1 and go to step 1

As you can understand under contention engine can do any useful work
only in a short window between step 3 and step 1 when lock 1 is
released. The worst case it that under Windows XP another process
usually doesn't see a mutex flipping for a short period of time and
engine halts completely.

II. In Firebird 1.5 RC1 I changed deadlock detection logic to detect
them instantly and mentioned loops began to loop quickly. This increased
pressure on other parts of engine and multiple real internal deadlocks
from all over the place appeared. Examples: deadlocks in index bucket
recombination code (rare) or deadlock in locate_space (often) when
another process does mark_full or extend_relation (locate_space locks
data page first and than pointer page while mark_full or extend_relation
locks pointer page first and only than data page - and both places do that
without any timeouts).

Could you explain original concept to me ? What is a secret weapon to
fight such problems ?

What I'm doing now is I'm trying to identify places with real
deadlocks reported and fix them up via adding retry spin loops.
But I don't think this was your original concept.

> Jim Starkey

Nickolay Samofatov