Subject Re: [IB-Architect] Insert Speed
Author Jan Mikkelsen
Jim Starkey <jas@...> wrote:

>A PIP (page inventory page) is indeed a bitmap of free vs. used
>pages. Each pip keeps a low water mark to avoid repetitive
>searching. It could made smarter at the cost of density, buth I
>don't there there is a bottleneck here.


A low water mark for searching is just fine, given that every allocation is
just for a single page. If the PIP has a free page, the low water mark
points straight to the free page, otherwise the low water mark says "full".
I don't think it needs to be made denser on disk, but if allocation is done
more than a page at a time, something more complex is needed to find a
sequence of n free pages.

>>How are ACID properties maintained for page allocation within the database
>>file? Is there a mini log of some sort for maintaining the integrity of
the
>>database file? If an in-core copy of the structure is used, of course it
>>will need to be committed to disk at appropriate points. The reason for
>>this question is that (I assume) when using page by page allocation, at
>>least one write is necessary for each successive page allocation to ensure
>>the consistency of the underlying database file. Allocation by extent
would
>>(I expect) reduce the number of write necessary.
>>
>
>Could you explain ACID?


Atomicity, Consistency, Isolation, Durability.

>The cache maintains precedence relationships to ensure that the PIP
>is written before any of the pages allocated from it, but it is
>far from a 1:1.


So write ordering is the answer. Presumably the relevant pointer page for
the table would also need to be written before the page itself. I also
would guess that the order of the writes must be PIP, pointer page, data
page.

What happens if the PIP is written, but a system crash stops the pointer
page from being written? Is the page lost to future allocations, or does
the database startup procedure check for lost pages?

What are the recovery procedures to check database file integrity on
startup?

On a slightly tangential issue: I assume that Interbase opens files with
O_SYNC on Unix, or FILE_FLAG_WRITE_THROUGH on Win32.

>I did a test about a lifetime back of the performance gain
>by ignoring careful write altogther. The difference was in
>the order of a couple of percentage points -- not enough to
>get excited about.


By careful write, I assume you mean ordered writes to ensure integrity. The
risks of abandoning ordered writes in this context must be pretty severe.

(Incidentally, I have also heard the term "careful write" applied to
mirrored disks: Ensuring that one side of the mirror is successfully
written before starting the second write, so that you either get the old or
the new page contents after a system crash, rather than two copies of a
compromised page. An expensive technique for the totally paranoid. Just a
minor terminology difference.)

>>I don't have a very clear view of the semantics or structure of the
database
>>file.
>
>That's a big question, sir. The general page types are:
>
> Header. One per database
> Page Inventory Page
> Transaction Inventory Page
> Pointer Page (points to data pages in table)
> Data page
> Index root page (one per table)
> Index page
> Generator pages
>
>Pick your favorite and ask.


Oh, I don't like to pick favourites. I'll ask about them all. :-)

I understand PIPs, TIPs and data pages. Pointer pages are later in this
message.

I assume index pages are pretty much classic b-tree indexes. Interior nodes
ultimately pointing to leaf pages containing an ordered list of key value
and record identifier pairs. The leaf pages are probably more complex than
the classic structure to handle isolation and recovery.

You mentioned that indexes store a record number. I assume this is really a
classic record ID structure, something like a 24 bit page number and an 8
bit index number for the record on the page represented as a 32 bit value.

The index root page presumably has pointers to the root pages of the various
indexes on a table. How are indexes named on this page? Is it just an
integer, or does it actually contain an index description?

Generator pages seem pretty clear (and not that interesting). Presumably a
set of <generator_id, generator_value> pairs.

How are blobs stored? I assume they are also stored page at a time. It is
just a linked list of pages, or is random access possible? If random access
is possible, why isn't that exposed in the API?

That leaves the header page. Presumably values to allow bootstrapping, page
size, version number, pointers to system tables which describe where the
tables live, etc.

What have I missed?

>>On the allocation of pages from the OS to the database file, requesting
>>pages in larger chunks from the operating system can be faster, as well as
>>helping the underlying filesystem reduce fragmentation. On the level of
the
>>database file, I don't think allocating in (say) 64KB or even 256KB chunks
>>is a big issue.
>
>On VMS (ugh, oh double ugh) we do a pretty much standard extension
>algorithm (double the size up to a limit). I didn't know that
>page allocation was a Unix concept? My understanding was that
>Unix allocated pages on demand and that seeking to an arbitrary
>pages and writing would allocate that page leaving holes in the
>unreferenced space. Never look, though.


The Unix API has no concept of page allocation, or of sparse files, for that
matter. Unix semantics guarantee that if you seek past the end of file and
write something there, reads of the data between the old end of file and the
written data will return zeros. It is the filesystem implementation which
decides whether the file is sparsely allocated. Traditional Unix filesystem
implementations have not allocated pages for the gap and behave pretty much
as you expect, although that is not required to provide the Unix API
semantics. The Veritas VxFS file system, for example, is an extent based
filesystem for Unix. The application can control extent sizes and
allocation policy.

Windows NT makes this distinction at the API level. By default, files are
not expected to be sparse, so extending a file will allocate pages. What
allocation actually does will vary according to the underlying file system.
NTFS will allocate pages in runs (like extents). Allocating more space at
once should help NTFS create longer runs.

To invoke sparse allocation, the underlying filesystem must support sparse
allocation, and the application must expicitly request sparse allocation for
that file using a DeviceIoControl call. Sparse allocation is also possible
with file compression, although the semantics are different.

Extension of an operating system file will typically require at least one
write, unless you are using a filesystem which doesn't provide metadata
guarantees, and requires a fsck style operation on restart after a crash.
And they are best avoided.

Incidentally, both Windows NT and VxFS allow operations on files to bypass
the buffer cache, which would be nifty in a superserver implementation.
Does the NT version of IB use FILE_FLAG_NO_BUFFERING?

>>On the allocation of pages to tables (the subject of my question), I don't
>>know whether or not there is a win in larger allocations, although I
expect
>>there is. Is fragmentation within large databases a big issue?
>>
>>David Schnepper has posted that he recalls the bottleneck was searching
for
>>pages within the table with enough free space to take the new record.
What
>>does the structure for recording usage within a table look like? How
>>expensive is the maintenance?
>
>Pointer pages are allocated to tables as needed (released only
>when table is deleted). Data pages have the classical database
>structure: a line index at one end holding offset on page and
>length of each record segement. The engine remembers where it
>stored the last record and starts looking for space there. The
>pointer page also keep track whether a data page is known to be
>full. Trying to keep more accurate statistics would be extremely
>expensive in classic.

The pointer page seems to serve two purposes. Presumably it stores database
pages allocated to the table (numbered relative to the database, not the
table), and whether the page is full or not. Conceptually the list of
database pages allocated to the table should live outside the table itself,
but whether a table is full or not should live inside the table.

I guess this all means that to check the overall database file consistency,
you have to have some knowledge of table structure. Because allocation is
page at a time, every database page allocated to a table needs to be in a
pointer page.

How are pointer pages related? Are they in a linked list, or is a list of
pointer pages for a table stored in a special pointer page?

What about pages allocated to indexes? Is there a variant of a pointer page
to see which pages are allocated to an index?

The summary is that finding a page for a large record in a big table can
actually be very expensive.

An alternative would be separating allocation of pages to objects from
internal object book keeping. Using extent based allocation you can avoid
storing each page allocated to an object by number. By separating the
internal book keeping the information about table page usage becomes much
more dense. Two bits per database per page could store, say, empty, <50%
used, >50% used, full.

Of course, my guesses about the pointer page structure could be totally
wrong, and something like this (or better) might exist.

> I'd like to see classic retire to a
>tropical island somewhere, but Charlies keeps coming up with
>reasons to keep it.


I can see lots of good reasons to drop the classic architecture. Contention
during table extension without a shared cache must be pretty ugly.

What I would like (and will do myself when I get the source if necessary) is
a linkable Interbase so a single user application can run entirely in a
single process. I'd be just as happy to link a superserver though.

Jan Mikkelsen
janm@...