firebird-support - Re: Understanding internals of sweep process

Subject	Re: Understanding internals of sweep process - why saves are bigger then db size?
Author	Dmitry Yemanov
Post date	2013-01-05T09:55:08Z

29.12.2012 23:31, karolbieniaszewski wrote:

>
> i have problem with understanding internal work of sweep
> my db size is 52.74GB i detach from db restart fb server and run sweep
> by gfix -sweep
>
> now i see in task manager that Firebird 2.5.3.26584 write to disk 1154 GB
> this is 21 times bigger then db size itself!
>
> - db is not corrupted - it is restored from backup and then i delete 25%
> of date from it.
>
> sweep process is still running.
>
> here are details form monitoring tables and task manager

What we see here:

1) 71M pages have been read to the cache by the sweeper. It corresponds
to 1.1TB. Surely it's too much for a 53GB database. It may mean that
many pages were accessed again and again in the more or less random
order. Looks like complex cross-record dependencies (many fragmented
records?).

2) 300K pages have been written by the sweeper itself. It corresponds to
4.8GB. Quite fine here.

3) 69M pages have been written by the engine background threads. Namely,
they are: garbage collector and cache writer. Provided that we didn't
break anything in the logic recently, background GC is expected to be
disabled for the sweeper, as it works synchronously. So it should be the
cache writer who did all the page writes. Again, it looks too much for a
53GB database, but knowing how many reads were performed, it doesn't
look extreme anymore, as any page being read into a cache full of dirty
pages means a page write.

So the actual problem is in (1) and, at the first glance, it's caused by
either many fragment chains or many version chains (but we know that
only 25% of records have versions -- delete stubs).

I have a guess what could be happening here. We use a WIN_large_scan
flag for the sweeper (as well as for regular full table scans) which
places the last released buffer to be first for reusing (as a measure
against the well known cache pollution issue). This works fine if we
read records sequentially from the same page and this is what the
sweeper normally does. But any jump to another page (following the
fragment pointer or backversion pointer) will immediately remove our
last data page from the cache (as it's full with dirty pages) and it
will have to be read from disk again once we processed fragments /
backversions and continue the cleanup with the next record. The more
records are fragmented the more I/O numbers will get closer to record
count (maybe even with a 2x factor) instead of data page count.

Dmitry