Subject Re: [IB-Architect] Insert Speed
Author Jim Starkey
At 11:23 AM 4/9/00 +1000, Jan Mikkelsen wrote:
>
>A low water mark for searching is just fine, given that every allocation is
>just for a single page. If the PIP has a free page, the low water mark
>points straight to the free page, otherwise the low water mark says "full".
>I don't think it needs to be made denser on disk, but if allocation is done
>more than a page at a time, something more complex is needed to find a
>sequence of n free pages.
>

All allocations are a page at a time.

>So write ordering is the answer. Presumably the relevant pointer page for
>the table would also need to be written before the page itself. I also
>would guess that the order of the writes must be PIP, pointer page, data
>page.
>

And those would be bad guesses. The overriding principle behind
careful write is that an object is always written before a pointer
to an object so the worst case of a crash is lost space. Assume a
transaction that inserted a single record that required allocation
of an additional data page. Here are the precedence relationships:

1. The data page must be written before the pointer page.
2. The PIP must be written before the data page.
3. The data page, the pointer page, the PIP page must all be
written before the Transaction Inventory Page (TIP) indicating
the transaction has committed.


>What happens if the PIP is written, but a system crash stops the pointer
>page from being written? Is the page lost to future allocations, or does
>the database startup procedure check for lost pages?
>

The database validation function will find lost pages. There isn't
much to be gained by starting an extremely expensive validation
pass at start up time and much to be lost. A few lost pages doesn't
have much effect on the world.

>What are the recovery procedures to check database file integrity on
>startup?
>

By design, none are needed.

>On a slightly tangential issue: I assume that Interbase opens files with
>O_SYNC on Unix, or FILE_FLAG_WRITE_THROUGH on Win32.
>

I'm not sure of the current state. It used to be an option. When
it first showed up on SunOS it was such an infinite dog that making
it mandatory seemed like cruel and unusual punishment for users.

Interestingly enough, given the pattern of writes (generally forced
by a commit verb), the Unix cache rarely screws things up, and gives
a beneficial second level cache.

>
>By careful write, I assume you mean ordered writes to ensure integrity. The
>risks of abandoning ordered writes in this context must be pretty severe.
>

Merely catastrophic. I just wanted to see if the degree of overhead
so I could find something else to worry about. I don't believe in
thinking about performance; I want numbers. I prefer to attack real
hot spots.

>(Incidentally, I have also heard the term "careful write" applied to
>mirrored disks: Ensuring that one side of the mirror is successfully
>written before starting the second write, so that you either get the old or
>the new page contents after a system crash, rather than two copies of a
>compromised page. An expensive technique for the totally paranoid. Just a
>minor terminology difference.)
>

The term 'careful write' was in use long before the idea of mirrored
disks popped into somebody's head. The file systems used on DEC's
RSX and RSX+2 (aka VMS) were the first use that I am aware of.

>
>I understand PIPs, TIPs and data pages. Pointer pages are later in this
>message.
>
>I assume index pages are pretty much classic b-tree indexes. Interior nodes
>ultimately pointing to leaf pages containing an ordered list of key value
>and record identifier pairs. The leaf pages are probably more complex than
>the classic structure to handle isolation and recovery.
>

Ah, sir, wrong again! They are less complex because they don't worry
about rigid consistency between index and records for a number of
reasons. First, the multigenerational architecture requires that the
index (transciently, admittedly) reflect multiple keys per logical
record. Second, there is no hope that synchronize index pages with
data and pointer pages. Third, the index is imprecise -- all numbers
and date are converted to double precision floating during the key
generation process. The easy solution is for the engine to treat
the indexes as noisy. Before it believes the index, it fetches the
appropriate version of the record and applies the appropriate predicate
to make sure it is a keeper.

>You mentioned that indexes store a record number. I assume this is really a
>classic record ID structure, something like a 24 bit page number and an 8
>bit index number for the record on the page represented as a 32 bit value.
>

A record number is decomposed into relative page number and line index
on page. The relative page number is decomposed into relative pointer
page number and point page slot. The various constants are computed
my page size.

>The index root page presumably has pointers to the root pages of the various
>indexes on a table. How are indexes named on this page? Is it just an
>integer, or does it actually contain an index description?
>

The index root page contains physical information about the index
(page number of root, number of keys, datatypes of keys, etc).

>Generator pages seem pretty clear (and not that interesting). Presumably a
>set of <generator_id, generator_value> pairs.
>

So boring I don't remember. Anyone? Anyone?

>How are blobs stored? I assume they are also stored page at a time. It is
>just a linked list of pages, or is random access possible? If random access
>is possible, why isn't that exposed in the API?
>

Blobs are stored in one of three ways depending on size. Blob that fits
on a regular data page is a level 0 blob, is comingled with records, and
if possible, will be stored on its parent's data page. Large blobs are
stored on dedicated blob pages with either a list of blob pages (level 1)
or a list of blob pointer pages (level 2) comingled with records as
per a level 0 blob.

The engine supports two types of blobs: segmented blobs and stream
blobs. Segmented blobs maintain segment boundaries with embedded
segment lengths. Because of the embedded lengths, direct access
isn't supported for segmented blob. Stream blobs are just stuff.

gds_seek_blob is what you want. The head of documentation didn't
like me (or anyone else, for that matter) so I don't know why it
didn't get documented. The array code, properly layered on stream
blobs, uses it heavily. Ask Ann, she gets paid to answer these
kind of questions (I think I'm going to send her a bill for this
one, however).

>That leaves the header page. Presumably values to allow bootstrapping, page
>size, version number, pointers to system tables which describe where the
>tables live, etc.
>

Close. Values to allow bootstraping, page size, version number. Also
transaction number, oldest interesting transaction, platform class,
major and minor version number, and the all important page number of
the PAGES table from which the locations of all other tables are
found (the base system tables have fixed table id's).

>What have I missed?

The kitchen sink. We have two: Jim's and Ann's (just like database
companies). Both are underutilized because Ann is in lala-land
trying to spring Interbase from the Cruel Oppressors. I hope you
guys appreciate the suffer I endure on your behalf!

>
>>>On the allocation of pages from the OS to the database file, requesting
>>>pages in larger chunks from the operating system can be faster, as well as
>>>helping the underlying filesystem reduce fragmentation. On the level of
>the
>>>database file, I don't think allocating in (say) 64KB or even 256KB chunks
>>>is a big issue.
>>
>
>Windows NT makes this distinction at the API level. By default, files are
>not expected to be sparse, so extending a file will allocate pages. What
>allocation actually does will vary according to the underlying file system.
>NTFS will allocate pages in runs (like extents). Allocating more space at
>once should help NTFS create longer runs.
>

Charlie's problem. Soon, your problem also.

>
>The pointer page seems to serve two purposes. Presumably it stores database
>pages allocated to the table (numbered relative to the database, not the
>table), and whether the page is full or not. Conceptually the list of
>database pages allocated to the table should live outside the table itself,
>but whether a table is full or not should live inside the table.
>

Pointer pages are table specific and store pages assigned to the table.

>I guess this all means that to check the overall database file consistency,
>you have to have some knowledge of table structure. Because allocation is
>page at a time, every database page allocated to a table needs to be in a
>pointer page.
>

Nope again. For a smart fellow you make a lot of good but wrong guesses.

Overflow pages (tails of record that don't fit on a data page), blob
data pages, and blob pointer pages are not recorded on pointer pages.

>How are pointer pages related? Are they in a linked list, or is a list of
>pointer pages for a table stored in a special pointer page?
>

Pointer pages number are recorded in RDB$PAGES and are also chained
for redudancy. To find a table, all you really need is the number
of the first pointer page. RDB$PAGES also have the index root page
for a table.

>What about pages allocated to indexes? Is there a variant of a pointer page
>to see which pages are allocated to an index?
>

Unnecessary. RDB$PAGES point to the index root page for each table.
The index root page points to the root page for each index. Walk
the tree, Luke.

>The summary is that finding a page for a large record in a big table can
>actually be very expensive.
>

Actually not. The engine stores big records tail first on overflow pages
until what's left fits on a page. Because the overflow pages aren't
in the pointer page, the overhead for large records is not much more
than linear on record size. Also many large records shrink during
record compression (sure you don't think we store padded out fields,
do you?).

>An alternative would be separating allocation of pages to objects from
>internal object book keeping. Using extent based allocation you can avoid
>storing each page allocated to an object by number. By separating the
>internal book keeping the information about table page usage becomes much
>more dense. Two bits per database per page could store, say, empty, <50%
>used, >50% used, full.
>
>Of course, my guesses about the pointer page structure could be totally
>wrong, and something like this (or better) might exist.
>

I'm not going to say that the Interbase scheme isn't perfect in all
respects, but I will say that the last time I designed a DBMS I did
it differently.

>
>I can see lots of good reasons to drop the classic architecture. Contention
>during table extension without a shared cache must be pretty ugly.
>

I single code base serving two masters is pretty ughly. I don't like
deferring architectural changes to benefit superserver because they
break classic. Cake, eat: Pick one.

>What I would like (and will do myself when I get the source if necessary) is
>a linkable Interbase so a single user application can run entirely in a
>single process. I'd be just as happy to link a superserver though.
>

Good point.

Jim Starkey