Subject | Re: [Firebird-Architect] Re: The Wolf on Firebird 3 |
---|---|
Author | Lester Caine |
Post date | 2005-11-17T11:07:46Z |
Jim Starkey wrote:
And I am probably answering my own question ;)
If we work with a 32bit character internally, then the 'binary zeros'
are stripped when storing to disk so we are not wasting on disk storage?
If we pack the 32bit character to a suitable on wire format we will not
be wasting space. The wire format may be UTF-8 and the client converts
to a local character set, or the local character set could be sent, but
if the 'to disk' compression is good? Why not use on the wire and reuse
the code?
I am probably looking at a special case situation, since the data
contained in the Foundation Archive contains all languages ( or that is
the intention ) and there may be a better way of handling things, like
producing 'translated' fields against which sorting can be managed, but
the 'simple' view of a single 32bit character field format internal to
the engine sounds quite logical to me. Perhaps the problem is more one
of managing the collation problem and the storage require to contain the
various case, collation and mapping tables/algorithms for doing that
conversion across multiple settings?
The methods used to store multi-lingual data in 8bit character fields
along with a character set field look atractive, until you try and list
data that bridges two charater sets :( I suspect this is the real area
that needs covering? Using a database in any single language is not a
problem - switching between several is the real chalenge?
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services
Treasurer - Firebird Foundation Inc.
>>AFAIK VARCHAR variable isn't truncated to the used length beforeWhich then links to ....
>>compression. Why?
>
> Because records are stored in fixed formats. The unused space in
> varchars is zapped to binary zeros so compression will eliminate it.
> 32 Unicode is intellectually defensible, but a terrible waste ofWhich storage are we talking about here?
> storage. Why waste 4 bytes when probably 95% of the characters stored
> in Firebird world wide can be represented in a single byte?
And I am probably answering my own question ;)
If we work with a 32bit character internally, then the 'binary zeros'
are stripped when storing to disk so we are not wasting on disk storage?
If we pack the 32bit character to a suitable on wire format we will not
be wasting space. The wire format may be UTF-8 and the client converts
to a local character set, or the local character set could be sent, but
if the 'to disk' compression is good? Why not use on the wire and reuse
the code?
I am probably looking at a special case situation, since the data
contained in the Foundation Archive contains all languages ( or that is
the intention ) and there may be a better way of handling things, like
producing 'translated' fields against which sorting can be managed, but
the 'simple' view of a single 32bit character field format internal to
the engine sounds quite logical to me. Perhaps the problem is more one
of managing the collation problem and the storage require to contain the
various case, collation and mapping tables/algorithms for doing that
conversion across multiple settings?
The methods used to store multi-lingual data in 8bit character fields
along with a character set field look atractive, until you try and list
data that bridges two charater sets :( I suspect this is the real area
that needs covering? Using a database in any single language is not a
problem - switching between several is the real chalenge?
--
Lester Caine
-----------------------------
L.S.Caine Electronic Services
Treasurer - Firebird Foundation Inc.