firebird-architect - Re: [IB-Architect] Insert Speed

Subject	Re: [IB-Architect] Insert Speed
Author	Jan Mikkelsen
Post date	2000-04-10T12:51:04Z

Jim Starkey <jas@...> wrote:

>At 11:23 AM 4/9/00 +1000, Jan Mikkelsen wrote:
>>On a slightly tangential issue: I assume that Interbase opens files with
>>O_SYNC on Unix, or FILE_FLAG_WRITE_THROUGH on Win32.
>
>I'm not sure of the current state. It used to be an option. When
>it first showed up on SunOS it was such an infinite dog that making
>it mandatory seemed like cruel and unusual punishment for users.

Not using those options is basically in invitation to the operating system
to reorder your writes. If those (well, O_SYNC, anyway) options weren't
used when you did your 3% measurement, all you were measuring was the
internal IB overhead of managing the precedence relationships.

>Interestingly enough, given the pattern of writes (generally forced
>by a commit verb), the Unix cache rarely screws things up, and gives
>a beneficial second level cache.

How did you measure this? I think that is likely to just be luck:

- If the database file is memory mapped, you can call mmap and msync, or
their equivalents, to initiate a page flush. But I don't see how you can
prevent a the operating system from flushing a page and ignoring the
precedence graph. You also have the problem of using a database page size
smaller than the system page size causing adjacent pages in the database
file to be flushed. But I doubt IB uses memory mapped files (although my
guess record hasn't been perfect so far), because portability to machines of
the appropriate vintage would be difficult.

- You could call fsync or fdatasync, which writes all dirty pages for a
given file, abandoning any attempt at write ordering. That is also pretty
expensive.

In superserver, I can't see any benefit from a second level cache. It just
introduces lots of memory copies, although it does compensate for an
underconfigured IB cache. Memory copies are still cheaper than disk I/O.
In classic, a second level cache is essential.

[ On index structure ]

No wonder clustered indexes aren't on your list of desirable features.

>>How are blobs stored? I assume they are also stored page at a time. It

>>just a linked list of pages, or is random access possible? If random

access

>>is possible, why isn't that exposed in the API?
>
>Blobs are stored in one of three ways depending on size. Blob that fits
>on a regular data page is a level 0 blob, is comingled with records, and
>if possible, will be stored on its parent's data page. Large blobs are
>stored on dedicated blob pages with either a list of blob pages (level 1)
>or a list of blob pointer pages (level 2) comingled with records as
>per a level 0 blob.

Are there any levels of indirection for sufficiently large blobs, or does
seeking to the end of a sufficiently large blob involve walking a list of
blob pointer pages?

You haven't mentioned blob pages are a page type. I assume they belong
(conceptually) to the table, and each page is belongs to a particular blob.
Ie: There is no partial allocation of blob pages. I don't see how that
would make sense given that small blobs are stored inline with the record.

>The engine supports two types of blobs: segmented blobs and stream
>blobs. Segmented blobs maintain segment boundaries with embedded
>segment lengths. Because of the embedded lengths, direct access
>isn't supported for segmented blob. Stream blobs are just stuff.

Why do segmented blobs even exist? Was it the beginning of creating RMS
like record structures on top of blobs?

I expect that the underlying implementation of segemented vs. stream blobs
is be the same, with the engine just interpreting the contents of a stream
blob to provide a segemented blob service. If not, what is the difference,
and why?

>gds_seek_blob is what you want. The head of documentation didn't
>like me (or anyone else, for that matter) so I don't know why it
>didn't get documented. The array code, properly layered on stream
>blobs, uses it heavily. Ask Ann, she gets paid to answer these
>kind of questions (I think I'm going to send her a bill for this
>one, however).

I guess I just have to wait of the source.

[ On operating system file extension ]

>Charlie's problem. Soon, your problem also.

I can see how it will be my problem, but who's Charlie?

>>I guess this all means that to check the overall database file

consistency,

>>you have to have some knowledge of table structure. Because allocation is
>>page at a time, every database page allocated to a table needs to be in a
>>pointer page.
>
>Nope again. For a smart fellow you make a lot of good but wrong guesses.

I don't mind making wrong guesses on the way to understanding. Besides, it
makes things interesting.

>>What about pages allocated to indexes? Is there a variant of a pointer

page

>>to see which pages are allocated to an index?
>
>Unnecessary. RDB$PAGES point to the index root page for each table.
>The index root page points to the root page for each index. Walk
>the tree, Luke.

Walking the tree works fine as long as you don't loose a page. What is
stored in a standard page header? Presumably there is a page type, and a
page checksum/CRC or a flag at the start and end of the page to detect
partial writes. Is there also an identifier for the database level object
which owns the page for use by gfix? Relation ID for table pages? Relative
page number within the database object?

I guess indexes are expendable, and recovery could also work by a process of
elimination. That is more difficult for blobs and overflow pages.

>>The summary is that finding a page for a large record in a big table can
>>actually be very expensive.
>
>Actually not. The engine stores big records tail first on overflow pages
>until what's left fits on a page. Because the overflow pages aren't
>in the pointer page, the overhead for large records is not much more
>than linear on record size. Also many large records shrink during
>record compression (sure you don't think we store padded out fields,
>do you?).

I didn't mean records larger than the page size, I just meant records larger
than the amount of free space on the first candidate page for storing the
record.

Surely this approach must lead to more overflow pages than necessary. An
overflow page seems to be dedicated to a single record. Is that true? If
not, what is the structure for finding overflow pages with free space?

It sounds like the algorithm for finding a page for a new record finds the
first page in the table with enough space to at least hold a pointer to an
overflow page, and then an overflow page is the new record doesn't fit on
the page.

I've asked this a few times, but I haven't seen an answer: Has anyone ever
measured the level of fragmentation inside Interbase database files?

Jan Mikkelsen
janm@...