Subject | Re: [IB-Architect] Insert Speed |
---|---|
Author | Jim Starkey |
Post date | 2000-04-09T18:29:02Z |
At 11:23 AM 4/9/00 +1000, Jan Mikkelsen wrote:
careful write is that an object is always written before a pointer
to an object so the worst case of a crash is lost space. Assume a
transaction that inserted a single record that required allocation
of an additional data page. Here are the precedence relationships:
1. The data page must be written before the pointer page.
2. The PIP must be written before the data page.
3. The data page, the pointer page, the PIP page must all be
written before the Transaction Inventory Page (TIP) indicating
the transaction has committed.
much to be gained by starting an extremely expensive validation
pass at start up time and much to be lost. A few lost pages doesn't
have much effect on the world.
it first showed up on SunOS it was such an infinite dog that making
it mandatory seemed like cruel and unusual punishment for users.
Interestingly enough, given the pattern of writes (generally forced
by a commit verb), the Unix cache rarely screws things up, and gives
a beneficial second level cache.
so I could find something else to worry about. I don't believe in
thinking about performance; I want numbers. I prefer to attack real
hot spots.
disks popped into somebody's head. The file systems used on DEC's
RSX and RSX+2 (aka VMS) were the first use that I am aware of.
about rigid consistency between index and records for a number of
reasons. First, the multigenerational architecture requires that the
index (transciently, admittedly) reflect multiple keys per logical
record. Second, there is no hope that synchronize index pages with
data and pointer pages. Third, the index is imprecise -- all numbers
and date are converted to double precision floating during the key
generation process. The easy solution is for the engine to treat
the indexes as noisy. Before it believes the index, it fetches the
appropriate version of the record and applies the appropriate predicate
to make sure it is a keeper.
on page. The relative page number is decomposed into relative pointer
page number and point page slot. The various constants are computed
my page size.
(page number of root, number of keys, datatypes of keys, etc).
on a regular data page is a level 0 blob, is comingled with records, and
if possible, will be stored on its parent's data page. Large blobs are
stored on dedicated blob pages with either a list of blob pages (level 1)
or a list of blob pointer pages (level 2) comingled with records as
per a level 0 blob.
The engine supports two types of blobs: segmented blobs and stream
blobs. Segmented blobs maintain segment boundaries with embedded
segment lengths. Because of the embedded lengths, direct access
isn't supported for segmented blob. Stream blobs are just stuff.
gds_seek_blob is what you want. The head of documentation didn't
like me (or anyone else, for that matter) so I don't know why it
didn't get documented. The array code, properly layered on stream
blobs, uses it heavily. Ask Ann, she gets paid to answer these
kind of questions (I think I'm going to send her a bill for this
one, however).
transaction number, oldest interesting transaction, platform class,
major and minor version number, and the all important page number of
the PAGES table from which the locations of all other tables are
found (the base system tables have fixed table id's).
companies). Both are underutilized because Ann is in lala-land
trying to spring Interbase from the Cruel Oppressors. I hope you
guys appreciate the suffer I endure on your behalf!
Overflow pages (tails of record that don't fit on a data page), blob
data pages, and blob pointer pages are not recorded on pointer pages.
for redudancy. To find a table, all you really need is the number
of the first pointer page. RDB$PAGES also have the index root page
for a table.
The index root page points to the root page for each index. Walk
the tree, Luke.
until what's left fits on a page. Because the overflow pages aren't
in the pointer page, the overhead for large records is not much more
than linear on record size. Also many large records shrink during
record compression (sure you don't think we store padded out fields,
do you?).
respects, but I will say that the last time I designed a DBMS I did
it differently.
deferring architectural changes to benefit superserver because they
break classic. Cake, eat: Pick one.
Jim Starkey
>All allocations are a page at a time.
>A low water mark for searching is just fine, given that every allocation is
>just for a single page. If the PIP has a free page, the low water mark
>points straight to the free page, otherwise the low water mark says "full".
>I don't think it needs to be made denser on disk, but if allocation is done
>more than a page at a time, something more complex is needed to find a
>sequence of n free pages.
>
>So write ordering is the answer. Presumably the relevant pointer page forAnd those would be bad guesses. The overriding principle behind
>the table would also need to be written before the page itself. I also
>would guess that the order of the writes must be PIP, pointer page, data
>page.
>
careful write is that an object is always written before a pointer
to an object so the worst case of a crash is lost space. Assume a
transaction that inserted a single record that required allocation
of an additional data page. Here are the precedence relationships:
1. The data page must be written before the pointer page.
2. The PIP must be written before the data page.
3. The data page, the pointer page, the PIP page must all be
written before the Transaction Inventory Page (TIP) indicating
the transaction has committed.
>What happens if the PIP is written, but a system crash stops the pointerThe database validation function will find lost pages. There isn't
>page from being written? Is the page lost to future allocations, or does
>the database startup procedure check for lost pages?
>
much to be gained by starting an extremely expensive validation
pass at start up time and much to be lost. A few lost pages doesn't
have much effect on the world.
>What are the recovery procedures to check database file integrity onBy design, none are needed.
>startup?
>
>On a slightly tangential issue: I assume that Interbase opens files withI'm not sure of the current state. It used to be an option. When
>O_SYNC on Unix, or FILE_FLAG_WRITE_THROUGH on Win32.
>
it first showed up on SunOS it was such an infinite dog that making
it mandatory seemed like cruel and unusual punishment for users.
Interestingly enough, given the pattern of writes (generally forced
by a commit verb), the Unix cache rarely screws things up, and gives
a beneficial second level cache.
>Merely catastrophic. I just wanted to see if the degree of overhead
>By careful write, I assume you mean ordered writes to ensure integrity. The
>risks of abandoning ordered writes in this context must be pretty severe.
>
so I could find something else to worry about. I don't believe in
thinking about performance; I want numbers. I prefer to attack real
hot spots.
>(Incidentally, I have also heard the term "careful write" applied toThe term 'careful write' was in use long before the idea of mirrored
>mirrored disks: Ensuring that one side of the mirror is successfully
>written before starting the second write, so that you either get the old or
>the new page contents after a system crash, rather than two copies of a
>compromised page. An expensive technique for the totally paranoid. Just a
>minor terminology difference.)
>
disks popped into somebody's head. The file systems used on DEC's
RSX and RSX+2 (aka VMS) were the first use that I am aware of.
>Ah, sir, wrong again! They are less complex because they don't worry
>I understand PIPs, TIPs and data pages. Pointer pages are later in this
>message.
>
>I assume index pages are pretty much classic b-tree indexes. Interior nodes
>ultimately pointing to leaf pages containing an ordered list of key value
>and record identifier pairs. The leaf pages are probably more complex than
>the classic structure to handle isolation and recovery.
>
about rigid consistency between index and records for a number of
reasons. First, the multigenerational architecture requires that the
index (transciently, admittedly) reflect multiple keys per logical
record. Second, there is no hope that synchronize index pages with
data and pointer pages. Third, the index is imprecise -- all numbers
and date are converted to double precision floating during the key
generation process. The easy solution is for the engine to treat
the indexes as noisy. Before it believes the index, it fetches the
appropriate version of the record and applies the appropriate predicate
to make sure it is a keeper.
>You mentioned that indexes store a record number. I assume this is really aA record number is decomposed into relative page number and line index
>classic record ID structure, something like a 24 bit page number and an 8
>bit index number for the record on the page represented as a 32 bit value.
>
on page. The relative page number is decomposed into relative pointer
page number and point page slot. The various constants are computed
my page size.
>The index root page presumably has pointers to the root pages of the variousThe index root page contains physical information about the index
>indexes on a table. How are indexes named on this page? Is it just an
>integer, or does it actually contain an index description?
>
(page number of root, number of keys, datatypes of keys, etc).
>Generator pages seem pretty clear (and not that interesting). Presumably aSo boring I don't remember. Anyone? Anyone?
>set of <generator_id, generator_value> pairs.
>
>How are blobs stored? I assume they are also stored page at a time. It isBlobs are stored in one of three ways depending on size. Blob that fits
>just a linked list of pages, or is random access possible? If random access
>is possible, why isn't that exposed in the API?
>
on a regular data page is a level 0 blob, is comingled with records, and
if possible, will be stored on its parent's data page. Large blobs are
stored on dedicated blob pages with either a list of blob pages (level 1)
or a list of blob pointer pages (level 2) comingled with records as
per a level 0 blob.
The engine supports two types of blobs: segmented blobs and stream
blobs. Segmented blobs maintain segment boundaries with embedded
segment lengths. Because of the embedded lengths, direct access
isn't supported for segmented blob. Stream blobs are just stuff.
gds_seek_blob is what you want. The head of documentation didn't
like me (or anyone else, for that matter) so I don't know why it
didn't get documented. The array code, properly layered on stream
blobs, uses it heavily. Ask Ann, she gets paid to answer these
kind of questions (I think I'm going to send her a bill for this
one, however).
>That leaves the header page. Presumably values to allow bootstrapping, pageClose. Values to allow bootstraping, page size, version number. Also
>size, version number, pointers to system tables which describe where the
>tables live, etc.
>
transaction number, oldest interesting transaction, platform class,
major and minor version number, and the all important page number of
the PAGES table from which the locations of all other tables are
found (the base system tables have fixed table id's).
>What have I missed?The kitchen sink. We have two: Jim's and Ann's (just like database
companies). Both are underutilized because Ann is in lala-land
trying to spring Interbase from the Cruel Oppressors. I hope you
guys appreciate the suffer I endure on your behalf!
>Charlie's problem. Soon, your problem also.
>>>On the allocation of pages from the OS to the database file, requesting
>>>pages in larger chunks from the operating system can be faster, as well as
>>>helping the underlying filesystem reduce fragmentation. On the level of
>the
>>>database file, I don't think allocating in (say) 64KB or even 256KB chunks
>>>is a big issue.
>>
>
>Windows NT makes this distinction at the API level. By default, files are
>not expected to be sparse, so extending a file will allocate pages. What
>allocation actually does will vary according to the underlying file system.
>NTFS will allocate pages in runs (like extents). Allocating more space at
>once should help NTFS create longer runs.
>
>Pointer pages are table specific and store pages assigned to the table.
>The pointer page seems to serve two purposes. Presumably it stores database
>pages allocated to the table (numbered relative to the database, not the
>table), and whether the page is full or not. Conceptually the list of
>database pages allocated to the table should live outside the table itself,
>but whether a table is full or not should live inside the table.
>
>I guess this all means that to check the overall database file consistency,Nope again. For a smart fellow you make a lot of good but wrong guesses.
>you have to have some knowledge of table structure. Because allocation is
>page at a time, every database page allocated to a table needs to be in a
>pointer page.
>
Overflow pages (tails of record that don't fit on a data page), blob
data pages, and blob pointer pages are not recorded on pointer pages.
>How are pointer pages related? Are they in a linked list, or is a list ofPointer pages number are recorded in RDB$PAGES and are also chained
>pointer pages for a table stored in a special pointer page?
>
for redudancy. To find a table, all you really need is the number
of the first pointer page. RDB$PAGES also have the index root page
for a table.
>What about pages allocated to indexes? Is there a variant of a pointer pageUnnecessary. RDB$PAGES point to the index root page for each table.
>to see which pages are allocated to an index?
>
The index root page points to the root page for each index. Walk
the tree, Luke.
>The summary is that finding a page for a large record in a big table canActually not. The engine stores big records tail first on overflow pages
>actually be very expensive.
>
until what's left fits on a page. Because the overflow pages aren't
in the pointer page, the overhead for large records is not much more
than linear on record size. Also many large records shrink during
record compression (sure you don't think we store padded out fields,
do you?).
>An alternative would be separating allocation of pages to objects fromI'm not going to say that the Interbase scheme isn't perfect in all
>internal object book keeping. Using extent based allocation you can avoid
>storing each page allocated to an object by number. By separating the
>internal book keeping the information about table page usage becomes much
>more dense. Two bits per database per page could store, say, empty, <50%
>used, >50% used, full.
>
>Of course, my guesses about the pointer page structure could be totally
>wrong, and something like this (or better) might exist.
>
respects, but I will say that the last time I designed a DBMS I did
it differently.
>I single code base serving two masters is pretty ughly. I don't like
>I can see lots of good reasons to drop the classic architecture. Contention
>during table extension without a shared cache must be pretty ugly.
>
deferring architectural changes to benefit superserver because they
break classic. Cake, eat: Pick one.
>What I would like (and will do myself when I get the source if necessary) isGood point.
>a linkable Interbase so a single user application can run entirely in a
>single process. I'd be just as happy to link a superserver though.
>
Jim Starkey