Subject Re: [IB-Architect] Insert Speed
Author Jim Starkey
Boy, you ask a lot of questions. On the other hand, you are
obviating the need to write a Child's Guide to the Interbase
Engine.

At 10:51 PM 4/10/00 +1000, Jan Mikkelsen wrote:
>Jim Starkey <jas@...> wrote:
>
>>At 11:23 AM 4/9/00 +1000, Jan Mikkelsen wrote:
>>>On a slightly tangential issue: I assume that Interbase opens files with
>>>O_SYNC on Unix, or FILE_FLAG_WRITE_THROUGH on Win32.
>>
>>I'm not sure of the current state. It used to be an option. When
>>it first showed up on SunOS it was such an infinite dog that making
>>it mandatory seemed like cruel and unusual punishment for users.
>
>
>Not using those options is basically in invitation to the operating system
>to reorder your writes. If those (well, O_SYNC, anyway) options weren't
>used when you did your 3% measurement, all you were measuring was the
>internal IB overhead of managing the precedence relationships.
>

The comparison was problem with non-synced IO, O_SYNC, and a raw
file system (where Interbase runs just fine, thank you). I vaguely
remember that sync-ed IO on SunOS was significantly slower than
raw IO, which is always sync-ed. But it was a long time ago. If
somebody cares, they could easily measure the difference.

>
>In superserver, I can't see any benefit from a second level cache. It just
>introduces lots of memory copies, although it does compensate for an
>underconfigured IB cache. Memory copies are still cheaper than disk I/O.
>In classic, a second level cache is essential.
>

I think you are correct in both cases. Superserver doesn't need it,
and a system cache lets classic pass a page to another process without
a physical read.

>[ On index structure ]
>

I hate 'em. To work well, you have to know your cardinality. Guess too
low and you spinkle your records around the disk. Guess too high, and
you chase overflow areas. Guess just write, and it's almost as good
as Interbase's worst case.

>
>Are there any levels of indirection for sufficiently large blobs, or does
>seeking to the end of a sufficiently large blob involve walking a list of
>blob pointer pages?
>

A seek on a level 2 blob involves an index in a vector of blob pointer
page numbers, a fetch and index into a blob pointer page, and the
fetch of the blob data page. Everything is dense. Hard to be
much better, particularly since we apparently never told anybody
that you could seek into a blob in the first place.

>
>You haven't mentioned blob pages are a page type. I assume they belong
>(conceptually) to the table, and each page is belongs to a particular blob.
>Ie: There is no partial allocation of blob pages. I don't see how that
>would make sense given that small blobs are stored inline with the record.
>

Blob pointer pages and blob data pages are the same structure, but
one is a vector of longs and the other a vector of char.

>
>Why do segmented blobs even exist? Was it the beginning of creating RMS
>like record structures on top of blobs?
>

Very good guess. Nailed it first time. Seemed like a good idea at
the time.

>I expect that the underlying implementation of segemented vs. stream blobs
>is be the same, with the engine just interpreting the contents of a stream
>blob to provide a segemented blob service. If not, what is the difference,
>and why?
>

Yup. The other levels care, the engine, not even remotely. The engine
(other than the upper level of the actual blob calls) is completely
segment unaware.

>
>
>Walking the tree works fine as long as you don't loose a page. What is
>stored in a standard page header? Presumably there is a page type, and a
>page checksum/CRC or a flag at the start and end of the page to detect
>partial writes. Is there also an identifier for the database level object
>which owns the page for use by gfix? Relation ID for table pages? Relative
>page number within the database object?
>

We've been around and around and around on this issue. I originally
checksumed pages. Deej and Charlie (rightly) decided that this was
too expensive. RMS used to write a page version at the front and
back on the mistaken belief that disk controllers write blocks in
order (not true of HSC; don't know about current stuff). The RMS
guys now work for Microsoft on SQL/Server which (according to rumor)
now writes page version at the front and back.

Unless somebody can come up with a better schema, a page version
number at each is probably the best compromise, even if that's
what SQL/Server does.

Of course pointer and data pages have a table id and relative number.
The whole ODS is chock full of enough redundant information that
a skilled veterinarian to turn hamburger back into a cow. Ann
was going to write the cow reconstruction program but never got
around to it.

>I guess indexes are expendable, and recovery could also work by a process of
>elimination. That is more difficult for blobs and overflow pages.
>
>>>The summary is that finding a page for a large record in a big table can
>>>actually be very expensive.
>>
>>Actually not. The engine stores big records tail first on overflow pages
>>until what's left fits on a page. Because the overflow pages aren't
>>in the pointer page, the overhead for large records is not much more
>>than linear on record size. Also many large records shrink during
>>record compression (sure you don't think we store padded out fields,
>>do you?).
>
>
>I didn't mean records larger than the page size, I just meant records larger
>than the amount of free space on the first candidate page for storing the
>record.
>

Record that are bigger than a data page are special cased with overflow
that page don't show in the pointer page. For an initial insert, the
engine find space for a record without fragmenting. Yes, this is
wasteful of space in worst case, but the alternatives aren't pretty.
A record update can be fragmented if the doesn't fit because the record
header can't be moved and maintain its record number.

>Surely this approach must lead to more overflow pages than necessary. An
>overflow page seems to be dedicated to a single record. Is that true? If
>not, what is the structure for finding overflow pages with free space?
>

No, it leads to unused space. Got a better idea that doesn't create
hot spots?

>It sounds like the algorithm for finding a page for a new record finds the
>first page in the table with enough space to at least hold a pointer to an
>overflow page, and then an overflow page is the new record doesn't fit on
>the page.
>
>I've asked this a few times, but I haven't seen an answer: Has anyone ever
>measured the level of fragmentation inside Interbase database files?
>

I used to do it all the time. Devining entrails is a very good way
to learn something.

Jim Starkey