ib-conversions - RFC: Types from MsSql to IB and other more general differences.

Subject	RFC: Types from MsSql to IB and other more general differences.
Author	Claudio Valderrama C.
Post date	2000-07-15T07:59:38Z

Hello, originally this text was meant to be part of another prior posting,
but I decided to keep it separated. This is a comparation of types between
SqlMonster and InterPhoenix.

In MsSql, Transact-SQL has these base data types. I will put in the right
side of the equal sign the IB equivalent:

+binary(n) = char(n) character set octets. It has been reported that the IB
client library has a bug regarding octects, I want a confirmation, I never
saw a definitive answer in the kinobi NGs.

+bit = this is an heritage from Sybase. The internal implementation uses a
byte or a word. This is a boolean field. People have used char and smallint
in IB for that. When I worked with the Sybase engine, even Sybase
client-side tools on Sun platforms (DB-Workbench, for what I remember) had
problems dealing with this data type, so I discarded it from my list.

+char(n) = char(n).

+cursor = Only available through trickery in procedures. Some days ago I
sent a full example (one that compiles with the appropriate field names in
place), under the thread "Need help with stored procedures" on the
interbase.public.kinobi NG and the equivalent procedure with rdb$db_key,
that's cleaner to me. The basic steps are:
FOR select field[s] from tbl
FOR UPDATE
INTO :variable[s]
AS CURSOR mycursor
DO BEGIN
{update tbl | delete from tbl}
where CURRENT OF mycursor;
END

+datetime = date in IB5 and IB6 dialect 1; timestamp in IB6, including
dialect 1 (yes, both date and timestamp are synonyms in IB6-dialect 1).
BTW, Ann, I don't know why "timestamp" was used, it's misleading, I would
have preferred datetime, maybe it's the SQL standard again... "timestamp"
carries the meaning of an automatic mark of 'NOW' or current_timestamp and
this is not true, this field only keeps both date and time but it doesn't
get filled automatically. Sorry, but "datetime" in SqlServer seems better to
me. In SqlServer, the type uses 4 bytes and records date and time data from
January 1, 1753, to December 31, 9999, with an accuracy of three-hundredths
of a second, or 3.33 milliseconds. The base year is 1900. According to the
manuals, IB can record from 1 Jan 100 a.d. to 29 Feb 32768 a.d. (the base
year is 1858) with a precision of 1/10 millisecond. The timestamp (old
"date") type in IB is 8 bytes length, so no surprise it blows away SQL
Server in both range and precision.
There seems to be another bug in the documentation... some time ago, when
going backwards in time, I reached year 1 of our era and LangRef in pages 19
and 20 keeps saying the lower limit is year 100. To prove the error, I used
insert into dt values('1-Jan-0001')
being "dt" a table with a timestamp field. So, the real range in IB is
1-Jan-0001 A.D. to 29 Feb 32768 a.d. However, again, other correction, Bill
showed that IB6 only will accept up to year 9999 when passed as a string, so
the limit would be
insert into dt values('31-Dec-9999')
However, using the raw API, Bill showed, too that 1,469,902 and its negative
counterpart are the full range.

+decimal(p,s) = decimal(p,s).

+double precision, same as float(53) = double precision. See notes.

+float(n) = double precision or float, depending on "n". Float has no
configurable precision in IB and is not equivalent directly to this type.
See notes.

+image = BLOB sub_type N. I can't remember, there are some proposed
sub_types for specific image fields in the IB's header file, but don't
worry, sub_type is only for us, the engine doesn't care, you can use default
sub_type zero or a negative sub_type that's user defined; it doesn't affect
the storage, anyway.

+int = int, integer.

+money = numeric(18,4). IB doesn't allow a precision of 19 as Sql specifies
(19,4) for this field, but internally IB allows for numbers using 19 digits
of precision. The reason to not let 19 to be specified "officially" is that
1E18 < 2^63-1 < 9.9999999999999E18
because
2^63-1 =~ 9,223372036854775807e+18
so IB developers wanted to be honest because people can't use the small
margin from 9.223 to 9.999 if they were allowed to define (19,X) types...
incidentally, you may assess for yourself the honesty of the MS team. See
notes at the bottom.

+nchar(n) = nchar(n).
In LangRef, page 19, datatypes are explained. NCHAR doesn't appear. It's
not explained in any part in LangRef, although it's listed three or four
times. It would be nice to know exactly what it is. In my experience, it
always map to ISO8859_1 whose rdb$bytes_per_character is 1. It supports a
COLLATE specification according to the collations available in IB for
ISO8859_1. Instead, NCHAR in SqlServer is always UNICODE, so the low level
equivalence is in IB:
char character set UNICODE
but in IB, UNICODE has no collate other than the default, be warned. Maybe
it's better to map at high level simply to NCHAR even if the underlying
charset is different. Comments?

+ntext = blob sub_type text character set UNICODE. (Due to the previous
point, it might be charset ISO8859_1 instead, depending on what you want
really but remember that IB doesn't allow COLLATE on BLOBs.)

+numeric(p,s) = numeric(p,s).

+nvarchar(n) = nchar varying(n). But the low level equivalence (see the two
points above) is:
varchar character set UNICODE.

+real, same as float(24) = float. See notes.

+smalldatetime = date in IB5, timestamp in IB6. See discussion for datetime
above, IB doesn't provide an equivalent with less precision, it uses full
precision date and time fields. In MsSql, smalldatetime uses 2 bytes and
tracks date and time data from January 1, 1900, through June 6, 2079, with
accuracy to the minute. Probably it was created to lower storage
requirements in MsSql.

+smallint = smallint.

+smallmoney = numeric(10,4). It will have a slightly bigger range. Money and
smallmoney are types that doesn't come predefined in IB. See notes below.

+text = memo, namely, blob sub_type text. BTW, "text" is reserved word only
in the context of a sub_type declaration. The allowed sub_types are
TEXT,BLR,ACL,RANGES,SUMMARY,FORMAT,TRANSACTION_DESCRIPTION,EXTERNAL_FILE_DES
CRIPTION
and they don't interfere with declarations of fields with the same name.
For example, I tried with success the following sequence:
create table sql(sql int);
commit;
insert into sql(sql) values(1);
commit;
/* Notice "text" appearing two times. */
alter table sql add text blob sub_type text;
commit;

+timestamp = No equivalence on IB. Don't be fooled, read MS' help file
first:
«The SQL Server timestamp data type has nothing to do with times or dates.
SQL Server timestamps are binary numbers that indicate the relative sequence
in which data modifications took place in a database. The timestamp data
type was originally implemented to support the SQL Server recovery
algorithms. [snip] In SQL Server version 7.0, @@DBTS is only incremented for
use in timestamp columns. If a table contains a timestamp column, every time
a row is modified by an INSERT, UPDATE, or DELETE statement, the timestamp
value in the row is set to the current @@DBTS value, and then @@DBTS is
incremented by one. [snip] Never use timestamp columns in keys, especially
primary keys, because the timestamp value changes every time the row is
modified.» As a conclusion, it's for internal use and Sql7 doesn't need it
for recovery algorithms.
You can't compare it with a generator, unless you want to write triggers,
maybe it can compared with a record version plus a transaction identifier.
Also, it's legacy datatype, so let's forget about it.
A MsSql's timestamp is defined briefly as database-wide unique number. In
this sense, a logical equivalent may be the IB's rdb$db_key, because it's
unique in a database, too. However, their objective and origin are radically
distinct, so let's leave the parallel just here.

+tinyint = No equivalence in IB. The char type can't be used in IB to
operate numerically as in C/C++, so use smallint in IB.

+varbinary(n) = varchar(n) character set octets.

+varchar(n) = varchar(n).

+uniqueidentifier = char(32) [to keep a visible representation] or char(16)
character set octets [to keep a binary representation]. This is the field to
keep the infamous GUIDs. Transact-SQL provides the NEWID function that
creates a new GUID value... a MS, random version of IB's gen_id. :-) Well,
seriously speaking, it's clear the objective is not the same on both
functions.

Notes:
======

A) This exlanation from MS help file seems really bizarre:
«
Currency or monetary data does not need to be enclosed in single quotation
marks. However, the monetary data value must be preceded by the appropriate
currency symbol. For example, to specify 100 English pounds, use £100.
»
So, MS ties not only a data type to currency, but it also ties the monetary
sign to that value, is this proposed in the SQL standard?

B) In MsSql, precision is enforced. I mean that with
numeric(10,1) you can insert up to 9 digits in the integer part and one
digit in the fractional part. If you attempt to insert more than 9 digits in
the integer part, you'll get an exception. Compare that with IB: you can
happily insert up to 18 digits in the integer part, no complain. What's the
SQL standard, please?

C) Maximum length for character fields in MsSql is near to 8000 bytes in
Sql_7 and only 255 bytes in Sql_6. In Interbase, the limit is 32767 bytes.
However, indexes in IB are up to 252 bytes only; in MsSql, they are up to
900 bytes.
Also, long varchar fields are a performance nightmare on IB, better avoid
them, use BLOBs instead. I don't have experience measuring performance
problems with 8KB varchars in MsSql.
MsSql_7 is the first version that supports UNICODE. IB has a long history
of dealing with charsets. UNICODE in MsSql uses two bytes per characters,
same as C/C++. For unknown reasons to me, UNICODE in IB requires three bytes
per character. The only collation for UNICODE_FSS in IB is the default
collation.
Remember that in both engines, the size that matters for indexes is the
physical one, that's the logical char/varchar's length multiplied by the
size of the character set used, in bytes per character. However, as I had to
learn in the hard way, in IB, ISO8859_1 can be up to 3 bytes if a COLLATE
clause other than the default is included.

D) High level operations on date and time. The MsSql call is presented and
then the IB alternative is shown. Unfortunately, either you create a
procedure or rely on a UDF to mimic MsSql behavior. For the MsSql functions
that follow, "datepart" can be the reserved words millisecond, second,
minute, hour, day, month, year, dayofweek, dayofyear and quarter. Interbase
doesn't have "millisecond", Ms' dayofyear is yearday, Ms' dayofweek is
weekday and Ms' quarter is missing in IB.

+ getDate() = cast('now' as date) in IB5 and current_timestamp in IB6 in any
dialect.

+ DATEADD(datepart, number, date) adds "number" of "dateparts" to date
(that's really a datetime).
= In IB, no counterpart exists. For timestamp, adding an integer to a date
changes the date by such amount of days. To deal with the time part, one
must know that the value passed ia fractional, where adding N seconds means
adding N/86400.0, adding N minutes means adding N/3600.0 and adding N hours
mean adding N/60.0. Beware that due to the change in arithmetic rules in
dialect 3, one of the operands must be floating point or the result would be
zero, because in dialect 3, division of two integers follow the C rules,
that's an integer result and as such you would be adding zero time, that's
the reason I used the divisors (see in previous lines) as 3600.0, for
example. There's no built-in way to add months, years or quarters in a
single operation in IB. See the FreeUDFLib for some UDFs that deal with
timestamp types in Interbase. Please note that this library doesn't know how
to deal with the new DATE-only and TIME-only in dialect 3. In these cases,
unlike timestamp where you need to use the integer and fractional parts
depending on your needs, DATE-only is incremented or decremented in days by
adding an integer and TIME-only is incremented or decremented in seconds by
adding an integer.

+ DATEDIFF(datepart, startdate, enddate) returns the number of date and time
boundaries crossed between two specified dates. In other words, it counts
how many "dateparts" are between two datetimes.
= In IB5, the difference of two datetimes produces a double precision value
(the integer part is the number of days and the fractional part is the
fraction of a day). In IB6, the difference between datetimes produces a
numeric(18,9) value (again, the fractional part is the fraction of a day),
the difference between two dates produces an integer number of days and the
difference between two times produces a numeric(9,4) value (where the
integer part is the number of seconds and the fractional part is the
fraction of seconds up to a 1/10 milliseconds of precision). There's no way
to return those differences in other units than the ones specified. For
example, if you want the difference between two dates in months or years and
not days, you'll need an UDF. Again, see the FreeUDFLib for some functions.

+ DATENAME(datepart, date) returns a character string representing the
specified datepart of the specified date.
= No direct counterpart, but you can use something like
cast(extract(datepart from datevalue) as varchar(21))
to handle all cases. Remember that there're some dateparts not available in
IB and also, due the DATE-only and TIME-only data types in dialect 3, some
operations will be invalid in IB, for example, requesting the year of a
TIME-only expression.

+ DATEPART(datepart, date) returns an integer representing the specified
datepart of the specified date.
= This is simply
extract(datepart from datevalue)
with the limitation that some "dateparts" have another name in IB or don't
exist.

The predefined system variables for date and time in Interbase are
described in my site, at the URL
http://members.tripod.com/cvalde/document/DateTimeFunctions.htm
for people that want to know all the reserved words that give the current
date and time.

E) We already have seen that char/varchar in MsSql_7 are almost up to 8Kb
and previously they were up to 255 bytes. Interbase supports up to 23K but
don't ask me for good performance. There should be a way to deal with long
varchars, then. In MsSql, ntext contains unicode data, so the maximum is
2^30-1 characters, because each char is two bytes. In contrast, text can
have up to 2^31-1 characters, because each character is one byte, but
depending in the character string, the number of allowed characters may be
less. I understand this has to do with charsets, too. Image is a raw storage
up to 2^31-1. I will refer to these three field types as "TNI". In MsSql,
CAST can operate on text and ntext but it can't operate with user-defined
data types. One nicety is the CONVERT function (non ANSI SQL) to format date
and time as a string. Obviously, the client receives the field as a string
and not as date. I've used this feature in ASP applications, where I was in
need to show dates in the Universal Format (year first), independent of the
client's regional settings. Also, substring() can be applied directly to TNI
fields. There's a statement called WRITETEXT that can replace TNI fields'
contents is a non-logged way by default: it doesn't go to the transaction
log. To change only a portion of a TNI column, you can use UPDATETEXT, where
an offset is optional to insert data in the middle of the stream or to
delete a portion of the stream. A text pointer value is needed for these
functions, that's obtained by calling the TEXTPTR function applied to the
desired field. It returns a binary(16). UPDATETEXT has an option to place
the operation on the transaction log. ADO applications can use the
AppendChunk method to specify long amounts of TNI data. ODBC applications
can use the data-at-execution form of SQLPutData to write new TNI values
(Jim is using internally classes BLOB and CLOB, but maybe he needs the same
external interface). OLE DB applications can use the ISequentialStream
interface to write new TNI values (hint for Jason). DB-Library applications
can use the dbwritetext function (so this is used by the BDE but only when
it does the right assignment). Also, these fields support DATALENGTH and
READTEXT functions.
Let's see IB side: each BLOB can handle up to 2 GB in data, same as MsSql.
A BLOB can have a sub_type and a charset, so this is the way to distinguish
between text/ntext and raw binary BLOBs. The charset is enforced; the
sub_type is only a signal for the human reader and for applications that
rely on it as a convention. There's no CONVERT function in IB. CAST cannot
deal with BLOBs, you need Greg Deatz's FreeUDFLib. BLR has substr()
function, but it doesn't work with UNICODE (according to Dave Schnepper) and
probably it doesn't work with BLOBs, too. There's the horrible substr()
implemented in the ib_udf for char/varchar and Greg has some functions to
deal with BLOBs in the FreeUDFLIb (f_BlobLeft, f_BlobRight, f_BlobSize,
etc.) to cope with lack of native support and so f_BlobAsPChar is the
de-facto cast from BLOBs to char/varchar fields and f_StrBlob is the
opposite, but with a 254 characters limitation. For some reason,
f_BlobAsPchar kills my IB6, so if you experience the same problem, use
f_BlobLeft(blob,length) instead. Please note that UDFs don't know anything
about charsets or collations. There's no "writetext" nor "updatetext"
equivalent in stored procedures. Maybe if Ann explains once for all how to
leverage the power of segmented v/s stream BLOBs, we can advance. :-) Apart
from declaring an array, I don't know how to declare directly a stream BLOB
in a procedure. There's no instruction to read/write segments or do
insertions in the middle of BLOBs from sprocs. Furthermore, officially, BLOB
parameters in stored procedures are not supported even though I've been
using them since IB4.2. It;s possible even to define a BLOB variable in the
stored procedure, but it probably stores the BLOB_ID only. Because IB uses
MGA, a BLOB operation is ALWAYS "logged", I mean the old version is stored
and the new version, too, so updating a 100 MB BLOB is a real challenge for
the engine... is there any optimization here? Is only the changed segment
stored in the new version or all the BLOB is repeated? In the last case,
updating large BLOBs becomes unpractical as the db would grow very fast.
Each BLOB has a BLOB_ID that's the one you see on a normal select with
isql.exe, but you can't manipulate a BLOB through a handle with a function
like MsSql's TEXTPTR (see above) from a sproc, you must use a client-side
program or embedded SQL. I still think BLR may have some hidden surprises,
but we can't surface them from the current PSQL (sproc & trigger language).
One nicety of MsSql is that you're currently allowed to insert literal data
in TNI fields without creating a parameterized SQL statement, fill the
parameters and then submit, as it's the current limitation with IB. If you
tolerate UDFs, then there's a workaround in IB without parameters and it's:
insert into tbl(blobfield)
values(f_StrBlob('this is my literal string'))
so you can verify that it works if you really are using an script and can't
use parameters. The only requisite is you must ensure that this UDF is
available and defined in your target database.
In IB, BLOBs have charsets but not collations, so even if UPPER where
allowed to work on them, it wouldn't produce the desired effects with
characters other than the ones in the ['a','z'] range in the ASCII table.
(For example, accented characters won't be uppercased.) My only workaround
is to use one of the UDF calls highlighted above to get a string from a BLOB
and apply the correct charset and collation to that string to have it
uppercased properly. Since MsSql doesn't allow charset/collations
field-wise, there's no much to deal with, the setting is global, but it
proceduces acceptable results in my experience. More below in point I).
Because in MsSql, LOWER is also built-in, it does work as expected in all
cases. Being LOWER() an UDF in IB, it doesn't operate correctly on accented
or special characters, for example, those with umlaut.

F) The types decimal and numeric are exact synonyms in MsSql. In IB,
numeric(p,s) and decimal(p,s) map to the same storage implementation
(depending on the precision and dialect, it can be smallint, int, double
precision in dialect 1 or ISC_INT64 in dialect 3), but the system field
rdb$field_sub_type is 1 for numeric and 2 for decimal. BTW, INT64 is not
defined as value 16 in rdb$types, for people that want to extract metadata
directly from system tables.
I've read in the NGs that there're some subtle differences between decimal
and numeric in IB. I ask this issue to be clarified to know if it really
matters or it's only an "urban legend" that can be forgotten. I always use
numeric so I never worried to know subtleties regarding this possible
difference.
Also, the maximum limit for numeric/decimal storage in IB6 is 8 bytes for a
precision=18, whereas the default limit in Sql_7 is 13 bytes (so you can
have up to precision=28) and if the "-p" startup parameter is specified,
Sql_7 can go up to awesome 38 bytes of storage, so you can have precision=38
as a maximum. Therefore, the lesson is to be careful, because values kept in
MsSql can overflow IB's current possibilities.

G) SqlServer still lacks true DATE and TIME types. In IB6 dialect 3, they
are named exactly DATE and TIME and earlier DATE becomes TIMESTAMP (datetime
for me). So, MsSql's DATE is TIMESTAMP in IB6 and DATE in IB5 and earlier.
Delphi understand this field type as "datetime". Notice that IB6's true DATE
and TIME in dialect 3 are available only to dialect 3 capable clients,
that's clients using the raw IB-API, IBX, IBO, Jim's new ODBC driver or
Jason's new and first OLE/DB driver. ODBC drivers by Merant and Visigenic
along with the BDE DON'T understand dialect 3, they can connect to a dialect
3 db as a dialect 1 client but they cannot use DATE, TIME and fields defined
as large exact numerics or ISC_INT64 (fields defined in dialect 3 as
numeric(x,y) or decimal(x,y) where x>9). But they can access any
numeric/decimal field of in the range allowed in previous versions (up to
precision 15) provided that such field was defined by a dialect 1 client, in
whose case the internal storage is double precision as in IB5 and earlier
versions.

H) There's a mess of float / double precision types between both engines, be
careful. In MsSql, float accepts an optional notation as float(n) to specify
the number of bits used to store the mantissa of the float number in
scientific notation, thus "n" becomes the precision. If n is 53, that's
MsSql's "double precision" (same as the "double precision" in IB) and if n
is 24, that's MsSql's "real" ("float" in IB).

I) More on charsets and collations as promised in D). I will be doing a hard
copy/paste from MS documentation to avoid some kestrokes:
MsSql:
«
Code Pages and Sort Orders
The physical storage of character strings in Microsoft® SQL Server is
controlled by the code page and sort order selected during installation.
»
There's no default code page at the time of IB installation that can be
setup by the user. The default is always NONE, meaning the engine doesn't
attempt any transliteration. The charset can be specified at database
creation; otherwise, all fields should specify the desired charset. A
field's charset can override the database's charset. Unfortunately, the
collation should be set in a field-by-field basis.

MsSql:
«
While the code page and sort order control the format of how data is stored
in each database, the specification of the code page and sort order are
global to the server. Every database attached to a specific SQL Server
installation uses the same code page and sort order. The code page and sort
order are specified during SQL Server Setup.
»
So, you can argue that IB gives you more flexibility but I would like a
default that can be applied to new database automatically or at least, a
global COLLATE kept in rdb$database.

MsSql:
«
There are several code pages. The default code page for a server running SQL
Server is the 1252 - ISO character set, which contains the common characters
from most languages that originated in Western Europe. [snip] When you pick
a code page during the setup of SQL Server, the data in all character
columns in the database, and in all character variables and parameters in
Transact-SQL statements, is stored and interpreted using the bit patterns
from the indicated code page. [snip] If the server is running one code page
and clients are using another code page, then the clients may need to turn
on options in the SQL Server ODBC driver and OLE DB provider to support the
proper conversion of extended characters. In SQL Server version 7.0, the SQL
Server ODBC driver, and the OLE DB Provider for SQL Server handle this
conversion automatically. [snip] A new feature in SQL Server 7.0 is support
for Unicode data types, which eliminates the problem in converting
characters. Unicode stores character data using two bytes for each character
rather than one byte.
»
What is called here ISO1252 should be no other than WIN1252 for Interbase.
I've tried it and it seems to work well, provided that you connect with the
same charset to the database. Since IB is charset/collate per field, you can
see that several fields in system tables ALWAYS have its own charset and
collation regardless of the db-wide charset. For example, several fields use
UNICODE_FSS whereas rdb$message -the text of an user-defined exception- is
charset NONE.
One thing that I don't have clear is the level of translation the client
side of IB can achieve. It seems to be non-existent. If you read the phrase
above, both the ODBC and and OLE/DB (ADO) drivers are able to perform
translations to the client charset. In Interbase, connecting to a db with a
charset other than the one used in the user tables' fields is a direct call
to receive tons of "transliteration error" messages. I declare in public my
lack of in-depth knowledge on this area, when dealing with multiple charsets
in the same database at the same time.

MsSql:
«
It is easier to manage international databases if you:
Install SQL Server with the default code page.
Implement all char, varchar, and text columns using their Unicode
equivalents; nchar, nvarchar, and ntext.
The fact that Unicode data needs twice as much storage space is offset by
eliminating the need to convert extended characters between code pages.
[snip] SQL Server 7.0 stores all textual system catalog data in columns
having Unicode data types. The names of database objects such as tables,
views, and stored procedures are stored in Unicode columns. This allows
applications to be developed using only Unicode, which avoids all issues
with code page conversions.
»
MS has always had problems with those issues. Why didn't you notice any?
Probably because WIN1252 is fine for almost all of us, but it doesn't cover
100% the needs of languages in Central Europe, for example as I understand.
Then a translation must be done in the ODBC/ADO driver. This slows a bit the
retrieval process and also, forces the presence of a multitude of settings.
As I wrote, you are tied to those settings when you install MsSql and if you
want other charset and collation, fine, "just reinstall MsSql", a problem
that IB developers never have faced. Regarding IB, the names of db objects
are kept in UNICODE fields, too. MS found the solution by asking developers
to use UNICODE at the cost of wasting always two bytes per character. On the
IB side, UNICODE has only one collation and it doesn't give the correct work
for my needs, so the one to use is ISO8859_1. That's the reason in the
mapping of fields I suggested that Ms's NCHAR be mapped to IB's NCHAR,
because in IB's NCHAR, ISO8859_1 does the right thing if you add the correct
collate. Unfortunately, for some cases, the eager iso8859_1 is able to eat
from 1 to 6 bytes per character with collation DE_DE but the average is 3
bytes, according to Dave Schnepper.

MsSql:
«
Sort Order
The sort order is another option specified during the Setup program. The
sort order specifies the rules used by SQL Server to collate, compare, and
present character data. It also specifies whether SQL Server is
case-sensitive.
»
MS was clearer here. Collate doesn't tell too much to a newcomer, but SORT
ORDER makes any person to suspect at least what's being offered. The same
function is done by collations in IB: they control, inside a given charset,
the sort oder, the comparation rules and the transliteration issues.
Therefore, SORT BY, WHERE, GROUP BY, HAVING and UPPER are affected. In
SqlServer, case-sensitive/insensitive is an option. In Interbase as we know
it, it's only case sensitive, so people migrating from Access and MsSql will
be shocked. Wait, this is the past. The future is brilliant: Dave Schnepper,
the former ISC expert in charsets and collation, is committed to his drivers
and we have a beta set of INSENSITIVE charsets and collations for Interbase.
Contrary to MsSql, in IB the collation is always a PER-FIELD setting.
Tiresome IMHO, I would prefer some way to have a database-wide collation,
same as it happens with the charset.

MsSql:
«
SQL Server 7.0 uses two sort orders, one for the character code page and one
for Unicode data. Both sort orders are specified during setup.
»
No surprise here. Interbase uses a set of collations for each charset it
supports. And of course, even though it's more work to define the fields,
there's no need to reinstall the server to change those settings.

MsSql:
«
The sort orders are different for each SQL Server code page. For example,
the sort orders available with code page 1252 are:
Dictionary order, case-insensitive (default)
Binary order
Dictionary order, case-sensitive
Dictionary order, case-insensitive, uppercase preference
Dictionary order, case-insensitive, accent-insensitive
Danish/Norwegian dictionary order, case-insensitive, uppercase preference
Icelandic dictionary order, case-insensitive, uppercase preference
Swedish/Finnish (standard) dictionary order, case-insensitive, uppercase
preference
Swedish/Finnish (phonetic) dictionary order, case-insensitive, uppercase
preference
»
Compared to IB, WIN1252 is charset 53, uses 1 byte minimum and 1 byte
maximum, the default collation is called WIN1252 and the other collations
are:
PXW_INTL
PXW_INTL850
PXW_NORDAN4
PXW_SPAN
PXW_SWEDFIN
The PXW_ prefix means that they follow the rules of the Paradox for Windows
drivers. Of couse, the self-explanatory denominations in MsSql make the
election easier.

J) More even in charsets and collations.
Please, remember that IB needs the right charset to store strings or it may
throw a transliteration error. The charset can be specified for the whole
database at creation time. A field without charset specification is created
with the default one for the whole database. To discover the default for a
db, you can inspect the system table rdb$database that ALWAYS has one record
only: if the field rdb$character_set_name is NULL, the charset is NONE, the
default and all field will get it by default unless overwritten in a field
by field basis. If there's a string here, this is the charset that's being
applied to the fields by default. You must connect with the same charset
than the database uses to be able to work without transliterations errors.
Even if you don't see an awful message, you may get the wrong characters, as
this example shows in IB6:
select cast((_WIN1252 'paó') as char(10) character set unicode_fss)
from rdb$database
As you can test for yourself, the accented "o" is converted to another
character when displayed and no error is returned. If you want to see a
typical transliteration error, you can try:
select cast((_WIN1252 'paó') as char(10) character set win1251)
from rdb$database
That's a painful limitation in some cases, unless your program connects,
retrieves the default charset and reconnect with it. However, since each
field at creation time can overwrite the database default, the developer has
this freedom at the cost of the responsability of having one field that
specifies a charset that conflicts with the one for the whole database, so a
client using the db's charset as a reference could find problems.
In comparison, in MsSql, the client driver can do the translation to the
right charset when returning results to the calling application. However,
while IB can use database-wide and field-specific charsets, MsSql uses a
server-wide charset and changing it requires reinstalling the server and
rebuilding the databases, not a work for the faint of heart, of course. By
default, IB uses charset NONE, meaning characters are stored as they come
and returned without any transformation. Although I know that I should use
WIN1252 or ISO8859_1 (Latin_1) to deal with my native language (Spanish), I
usually simply let NONE rule. One problem is that some characters go in the
wrong position when sorting, but in practice, only a few times such
characters (mainly ~+n=ñ) go in the first place, so the disorder is not so
severe.
In IB (and probably in every engine), every charset carries a default
collation. There are some charsets in IB that have only one collation. All
of the native collations are case sensitive, meaning uppercase alphabet may
appear before all the lowercase alphabet, for example. To know what's going
with the cryptic collation name, I asked again Dave Schnepper and this is
the lowdown:
«
Oh, one more thing Claudio mentioned was the obscurity of names. InterBase
follows an old (and seemingly unpopular) method of naming locales. XX_YY
where YY is a country code (defined by an ISO spec I can't remember) and XX
is a language code (different ISO spec). (This is all documented in the
Collation Driver Construction Kit). Thus, FR_FR is French used in France,
and FR_CA is French used in Canada. GB_GB must be something like Chinese in
People's republic.
»

To finish this travel, I want to point out that on June 4, I reported the
following in interbase.public.kinobi, just because I was playing with
charsets and collations:
«
LangRef says in page 281 that charset win1250 supports collate pxw_slo, but
when I try to define a domain with this combination, I get
collate pxw_slo not defined.
and the same if I try to apply this collation to an "order by" clause.
Where is the bug, in the docs or in the engine?
»
and I'm still waiting for an answer. ;-)

K) Subtleties and mysteries of IB:
- There's no rdb$field_sub_type for numeric/decimal defined in rdb$types.
They are only documented in LangRef as 1 and 2, respectively.
- Also, type 16, ISC_INT64, the underlying implementation of numeric(p,s)
and decimal(p,s) with p>9 doesn't appear in the rdb$types, is this a bug in
the system tables??? All other field types are there!

There're some other internal types:
- QUAD (for legacy timestamp?).
- BOOLEAN (yes, BOOLEAN, but reserved only for UDF's return parameter type,
what's about making it visible in DSQL and make several people happy? It
seem totally unused and maybe not implemented at all).
- CSTRING (for UDFs).
- BLOB_ID it's the type of the BLOB identifier. I don't know if such
identifier is unique per table or per database, anyone cares to explain?
- The sub_types of BLOBs that only are reserved words in the context of a
blob field definition but don't clash with any field name: TXT, BLR, ACL,
RANGES, SUMMARY, FORMAT, TRANSACTION_DESCRIPTION and
EXTERNAL_FILE_DESCRIPTION. QLI can interpret them in human-readable format.
- EXPRESSION_INDEX, for some unfinished implementation of expressions on
things like
field1||field2 and other more complicated examples.

L) Other issues in MsSql:
- A field of type integral can be declared as "Identity". There no such type
in IB. The only way to emulate it is by writing a trigger that increments a
generator by means of the atomic call "gen_id()" and this value is passed to
an integer field in dialect 1 and to a numeric(18,0) field in dialect 3. A
identity field, if used as a PK field, makes the client application to lose
track of the record because it doesn't know what's the next value that was
assigned. In IB, the ancient trick is:
create triger tbl_bef_ins
for table tbl before insert as
begin
if (new.ident_field is null)
then new.ident_field = gen_id(gen_name, 1);
end
and then the application can use
select gen_id(gen_name, 1) from rdb$database
so really only if the application doesn't provide the value, the trigger
will fill in it. Also, if such identity column is not used in the client and
is not the PK field, then it can be excluded from the SELECT statement and
the server will provide automatically the value from the generator. The call
to gen_id() needs a generator's name previously defined and an integral
increment (integer in dialect 1 or up to numeric(18,0) in dialect 3) that
can be either negative or positive. I agree that writing 45 triggers an
defining 45 generators to mimic 45 identity fields is tiresome, but there's
no a shortcut yet in IB. Also, don't cross the 127 generator barrier in a
single database or you will face corruption.

- In MsSql_7, it's possible to call triggers directly as I understand. While
I don't know what happens in that case with a trigger being called without a
table, the arguments in favor of or against this idea might be endless.

- Due to a change in MsSql_7, I want someone to confirm or deny the fact
that it's no longer possible to hide the source code of the procedures (and
triggers, probably, too), because they are kept always in source form and
compiled the first time they are invoked. In Interbase, you can wipe out the
rdb$procedure_source and rdb$trigger_source fields in tables rdb$procedures
and rdb$triggers respectively and the engine will continue working, same as
in Sybase. You only have to keep for you the source, because if you attempt
to edit the trigger, you won't be able to retrieve the source from the db,
as expected. You can inspect the BLR (Binary Language Representation)
generated for procedures and triggers (and check constraints should be
possible, too, putting an option in the tool) by using Jason Wharton's
IB_WISQL.

- MsSql offers two levels of security: internal and OS-based. In the first
case, the server needs to define its own users. In the second case, the
server maps users from an NT domain. Since MS has almost ready the MsSql_2k,
I assume it will recognize the new ADS (Active Directory Service) schema
that replaces the NT4's domains. In Interbase, the security is on the
server. There's no integration with the operating system. Only ADMINISTRATOR
and ROOT can login without user/pw and they are mapped to SYSDBA. However,
isc4.gdb, the centralized security db in IB has a "host_info" table. I
understand it can map UNIX hosts and also, the USERS has UID and GID columns
to map UNIX users and groups, but I never have tried those facilities.

- MsSql has some predefined system roles that can be granted to normal
users. In IB, the only super user is SYSDBA and no other user can perform
all the tasks SYSDBA does. Also, MsSql keeps a centralized db of users (when
in native mode) but users should be mapped by the administrator in each
database. In IB, a user that's recognized by isc4.gdb for an IB running, can
attempt connections to all databases that exists in the same machine where
the IBServer is running. There are no predefined roles nor auxiliary super
users in Interbase.

M) Disclaimer: if you find that at some point I wrote about BLOBs and their
collations, I meant to write BLOBs and their charsets, because BLOBs can
only have charsets, not collations.
Disclaimer 2: don't get mad trying to look for the database aliases in the
server. IB uses the physical path from the root of the filesystem in the
host machine to identify databases.
Disclaimer 3: don't try to define referential constraints between databases,
because they don't exist in IB. Also, if you think that referential
constraints and triggers are exactly equivalent for checking consistency in
the db, please read first
http://members.tripod.com/cvalde/document/declaRefIntegVsTrig.htm
to understand that some boundary conditions can defeat your checks made in
triggers, so declarative integrity is the right way.
Disclaimer 4: don't get mad creating indices on IB; you will only defeat
performance if you use more indices than the strictly necessary ones.
Indexes are mainly to satisfy WHERE conditions against fields with good
selectivity (several different values) but not for SORTING.
Disclaimer 5: IB creates automagically indexes on PK, Unique and FK
declarations, see
http://members.tripod.com/cvalde/features/f4/AutoCreation.htm
to avoid creating duplicate indexes and so reducing performance. Remember,
too that FK declarations require exclusive access to the database, namely,
only one connection.
Disclaimer 6: greetings to our Mom. ;-)
Disclaimer 7: I'm going to bed, just 4 AM... again!

C.
---------
Claudio Valderrama C.
Ingeniero en Informática - Consultor independiente
http://members.tripod.com/cvalde