firebird-support - Problems with FB on W2K, vs IB on NT

Subject	Problems with FB on W2K, vs IB on NT
Author	Wilson, Fred
Post date	2003-03-05T14:29:28Z

Below is a semi-long, somewhat, rambling description of what we're seeing
and what we believe is happening.
I wrote the first part, followed by a second part written by one of the
other developers.

We're not 100% sure on any of this yet, and I've added a few extra notes, at
the end of the first section, that are other things we observed last Friday,
after I wrote the first part.

1st part:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some observations regarding Firebird (1.0) vs IB(5.6)..
We've run into some, rather serious roadblocks, in our attempt to convert to
and start using FB (currently 1.0) from IB5.6.
We've been running IB5.6 (setup as below), for a couple of years at a lot of
sites, running as described below, with no real problems.
In our in house testing with FB1.0, primaryly running on W2K, we've run into
problem(s).

- Our server platform is W2K, and NT4.0.
- The database servers are configured with Forced Writes Off.
- The clients are either NT4.0 or W2K
- TCP/IP is the network protocol.

We've spent some time looking at the Firebird source code, in an attempt to
validate, what we believe our understanding of what we're seeing, is.
We've also spent a lot of time (maybe a man month or more), looking at
everything that we could think of on the database servers.
We've watched things like Thread Counts, Thread Priorities, Thread States,
Lazy Write calls, Disk writes, Thread State changes, and on and on and on.
We *think* that we're starting to understand what we're seeing as the
"problem"..
Let me try and describe the "problem".
We have extrememly intensive database access software clients. Many of these
clients process "data" at the rate of 5 accounts per second.
Each account consist of at least 2 database access, a "read", followed later
by an "insert" and "update".
The class of the server machine probably isn't important at the moment, as
all that a more powerful (faster CPU, faster RAID access, more RAM, etc),
does is to allow more clients to run before it "breaks"..
Now, the problem that we see is that as the server machine gets busy (FB
server), as determined by CPU usage, the server starts using more and more
memory (W2k)..
At some point, the available memory goes way, way down (sometimes to less
than 10 megs), and, of course, everything crawls.
By looking at the code, and threads, thread states, etc.,, on the server,
we've come to this conclusion.
BTW, we don't see this when running IB5.6 on *exactly* the same hardware,
running *exactly* the same number of applications and SQL scripts.
For these test we intentional loaded the servers (test servers) up to 100%
CPU usage, and then began collecting stats.

It appears, from the stats, and by looking at the FB source code, that FB
has a separate thread for GC.. This certainly isn't new, we've been aware of
this..
It appears that the GC thread runs at the same priority as all the other
threads.
On a system, that's taxed (and we're be running some further test, to try
and determine what that threshold is), the GC thread,
since it's given the same amount time (time slice), by W2K, simply can not
keep up with the amount of work that it has to do (garabage collection), so
the amount of work for it to do, continues to increase.
We see memory usage climb and climb and climb.
At some point the OS, basically halts everything and starts going through
it's own internal memory "cleanup", and then some of the memory, is again
available for use.
On a taxed server the database size grows and grows (compared to IB5.6)..
When all the applications are shutdown (they've been running, but just
barely, basically getting almost no CPU time), the GC thread now has room to
work, and work it does. Running the server overnight (8, 10 hours) at 100%
usuage, will produce work for the GC thread, that, after shutting all the
applications down and stopping all the running scripts, on these lower
powered that normal servers, the GC thread will run, sometimes for several
*hours* at the CPU usage on the server stays at 100%
With IB5.6, we don't see these things happening.
Memory usage on the server will go up a little then come back down, go up a
little (maybe 20 ~ 30 megs), then come back down. The up and down usage
occurs in the course of a minute or so.
We can run the same test, on the same hardware, with the same database, and
the same OS, for many more hours with IB5.6, and it's still having no
problems. Memory usage on the server is fine, threads are stressed, for
sure, as shown by the thread queue, but no real problems, What we believe is
happening, and we don't have IB5.6 source to confirm, is that when a thread
(transaction) modifies a table, the next thread that comes along that
"touches" that table, checks to see if there is any "garabage" to collect,
and if so, *it* does it then.
What this means is that the GC never really gets left behind. It's always
pretty current with not a ton of queued up work because each thread, as a
table is being "touched", is cleaning it up, right then.
The problem occurs when the treads get into a "starved" state, that, can be
roughly defined as the number of threads in the thread wait queue is much
over about 1.

ADDITIONAL NOTES:
We've noticed that, upon test startup, NT4.0 has a lot more thread
context switching, than compared to W2K, and while any individual thread on
W2K seems to be more CPU because of less context switching, NT4.0 actually
seems to work "better", perhaps because more threads are given CPU time,
when compared over any amount to time.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

2nd part:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
it might just be me, but my attitude toward the post is more akin to a
sterile, non-commital approach - which would still hopefully generate
interest from the right fb diety. i think this stems from the fact that
although it 'seems' clear that something is amiss, we really don't know what
the root cause is (it may be a combination of background gc, Win2000/NT
differences, our apps, test environment, the actual test itself, moon phase,
etc.). maybe we're just whacked and our simple post will be answered and
resolved. but if it generates valid activity/inquiry, we can easily add
detailed information/data/opinion.

here goes:
although i haven't come across any ng's, docs, sites, faq, etc., explaining
this issue, i could have easily missed it and the answer (so forgive me if
this is a nb like post).

a lab test observation -
in trying to troubleshoot a software release, we've setup both fb(1.0xxx)
and ib(5.6xxx) on four identical laboratory test platforms (Win2000, blah,
blah, blah). the test is designed to marginally overdrive the db server cpu
resources (100% cpu, processor queue length of approximately 5) for a
sustained period of approximately eight hours. during prior/current tests of
ib, the clients and database would/do complete the test successfully. when
running the same validation against the fb variation, clients slowly become
unresponsive over time while the db server cpu activity remained constant.
even when shutting down these clients at the end of the test, it was found
that db server cpu activity still remained high.

during an attempt to narrow down the cause, fb server activity was monitored
using typical rdk/sdk/ddk windows instrumenting tools during a test. data
gathered from the individual fb server threads seems to suggest that when
the fb server is under full load, the gc thread would become starved. over
time, this may have attributed to a build up of spent data, resulting in the
possible lead in to client unresponsiveness.

while observing the shutdown of the individual testing clients, for each
client removed, the additional released cpu resource bandwidth would be
applied to the gc thread. after all but one inactive client was shutdown,
the fb gc thread continued to utilize the cpu resources at near 100% for
approximately 20 more minutes (after which cpu load reduced back to normal
levels).

for the fb guru's -
could extendedly starving an fbserver process cause the gc thread to
'backup', resulting it the above documented behaviour? if not, or if the
real culprit is documented elsewhere, could you point me in the right
direction?

oh, and if there is any additional information needed, please feel free to
ask . . .

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Best regards,
Fred Wilson
SE, Böwe Bell & Howell
fred.wilson@...

[Non-text portions of this message have been removed]