Subject Re: [Firebird-Architect] UTF-8 over UTF-16 WAS: Applications of Encoded Data Streams
Author Ann W. Harrison
David Johnson wrote:
> With a multibyte character set, whether UTF-8 or UTF-16, we need to get
> away from the idea of allocating maximum storage in bytes for the number
> of characters. A string allocation needs to be exactly as long as it
> needs to be, and no more. Byte size and character size need to be
> understood and treated as distinct measures that may not be directly
> related. i.e. length(x) and sizeof(x) may be distinctly different, and
> should be treated as independent values.

Firebird currently carries both a byte length (RDB$FIELD_SIZE) and a
character length (RDB$CHARACTER_SIZE)*. It allocates fields at the
maximum number of bytes that the character length could possible
require. That sounds like a lot of wasted space, but the trailing part
of the field is compressed before being written to a database page. It
does waste memory space, but Firebird is rarely criticized as a memory pig.
> In my not so humble and often underinformed opinion, when allocating
> space, you need to allocate the byte size of the string.

Jim's proposal for data stream encoding would solve the problem and
allow the definition of fields as varchar without a specific length.

> You may need
> to be prepared to split pages on different boundaries than you do today,
> and re-org pages if they get fragmented.

Err.. That's not the way it works. Records can be fragmented across
pages, but pages are fixed length. They don't fragment.
> After an insert or update, re-orging a single page that you have already
> read into memory is relatively cheap, since you will have to rewrite the
> entire page after modification anyways.

And Firebird currently does that - not every time a page is read, but in
the case where a page has space for the record that's about to be stored
but not contiguous space.

> Splitting a page is more complex,

Because of compression, records increase in length all the time. If a
record increases in length so that it no longer fits on page, it is
fragmented and the tail is stuck on a different page.

> but it is similar to what you do when you append a record and
> you run out of space in the last active page.

Firebird doesn't exactly append records. When it needs to store a new
record, it checks the active pointer page for pages with space, so a new
record can go on a very early page if other data has been deleted from
that page.