Subject Converting of an ISO8859_1 database to UTF8
Author patrick_marten
Hello,

I'm working on a new release of my application, in which some bigger changes are supposed to happen. I need some assistance please :)

Sorry, it's a bit long, but it's about a sensitive matter, so I wanted to provide as many details as possible.


1. The current version
-----------------------
1.1 It's using Firebird 2.1.3.18185 to access a database with ISO8859_1 character set.
1.2 Several tables of the database have fields declared as follows:
FIELDX BLOB SUB_TYPE 1 SEGMENT SIZE 80 CHARACTER SET ISO8859_5,
FIELDY VARCHAR( 100) CHARACTER SET ISO8859_5,
Those fields weren't used so far, because IBO didn't support unicode when this version was created, but they do play a role below, so I'm mentioning them.
1.3 Several tables contain BLOB SUB_TYPE 1 fields, which store RTF-text in them


2. The new version
-------------------
2.1 Although it's not neccessary, I was planning to use the latest Firebird version - currently 2.5.1.26351
2.2 Since IBO supports unicode now, my application is supposed to support it as well, so that the "new" database will have UTF8 as default character set
2.3 Fields mentioned in 1.2 will no longer have to have a special character set - everything UTF8
2.4 Fields mentioned in 1.3 have to become BLOB SUB_TYPE 0 fields, as the content will be stored in a different way in the new version. The content must not get lost of course
2.5 A lot of structure changes will come with the new version too (new tables, fields etc.)


While all the changes are not a problem for new customers, they pretty much are a big issue for existing customers, as I will have to "convert" their existing databases, once they install the new version. So the requirements from point 2 have to happen somehow on their machines.

3. My ideas / concerns so far
------------------------------
3.1 Using the script "convert2utf8.sql" from IBExpert site (http://www.ibexpert.com/download/character_set_conversion/)
I've tried it, but am not very happy with the result:
3.1.1 Problem #1 is: the fields mentioned in 1.2 remain as they are, i.e. ISO8859_5 in that case. The script probably has to be executed several times - each time with a different source_characterset, i.e. source_characterset=ISO8859_1, source_characterset=ISO8859_5, ...
3.1.2 Problem #2 is: although all other columns seem to have UTF8 as character set, the database itself still seems to have ISO8859_1 as character set (showed in IBExpert)
3.1.3 Problem #3 is: after my attempt to convert an existing database it seems to miss 5 tables, 6 views, 58 indicies and to have 3 triggers more than in the source database...
3.1.4 Structure changes (2.4 and 2.5) would have to happen additionally afterwards
All in one I'm not very convinced by this option...

3.2 Using a component like TIB_DataPump
I don't have any experience with it and don't know, if that's an option to accomplish what I need. The thoughts so far:
3.2.1 I probably would have to make an UTF8 database first, which has the same structure as the source database - just different character sets
3.2.2 One would be accessed by a connection with ISO8859_1 as character set, one by a connection with UTF8 as character set
3.2.3 However the component would pump the data, it would get converted to UTF8 that way (I hope)
3.2.4 Structure changes (2.4 and 2.5) would have to happen additionally afterwards

3.3 Doing everything manually
Means probably the most work, but appears to be the safest way at the moment.
3.3.1 I could start with the final UTF8 database, i.e. with all the structure changes etc. included
3.3.2 The source database would be accessed by a connection with ISO8859_1 as character set, the UTF8 one by a connection with UTF8 as character set
3.3.3 Going through all the tables and "pump" the data manually into the new database, whereby the conversion to UTF8 would happen automatically (I hope). Good thing would be, that I could do the steps like 2.4 at the same time

3.4 Some other / better way(s) to accomplish what I need, you could think of

The new version will be installed into a new folder, so that there will be files of two different firebird versions on the machine, until the old version gets uninstalled. So I also would need to know when and how the jump to Firebird 2.5.1 should be done, especially considering the note I found in one of the PDF files: "IMPORTANT: Before upgrading please pay attention to the files in the /misc/upgrade/metadata directory of your Firebird 2.1 installation as an essential step for preparing your databases for use under Firebird 2.1. If your databases contain metadata fields (like object names, comments/descriptions, computed field definitions, PSQL, view or constraint sources) storing non-ASCII data then the backup/restore upgrade cycle may not be enough to make them usable." This one sounds as if it's not relevant for me, as my databases are already used under Firebird 2.1, but I also believe having read, that about the same applies to blob fields and a change from Firebird 2.1 to Firebird 2.5.1. Maybe I mix up things here though...

Regarding 3.1.4 and 3.2.4: I've found this example on how to change the blob type:
UPDATE RDB$FIELDS SET RDB$FIELD_SUB_TYPE = 0
WHERE RDB$FIELD_NAME=(SELECT RDB$FIELD_SOURCE FROM RDB$RELATION_FIELDS
WHERE RDB$FIELD_NAME='SOME_TEXT_BLOB_FIELD' AND RDB$RELATION_NAME='SOME_TABLE');
Seems to work without the content getting lost. Sure, it's in the wrong format that way, but it seems to be possible to obtaint the content and then to save it in the new format. But depending on the choice of the way to go, this could be irrelevant.


All in one I would like to ask you for some assistance and an as detailed instruction as possible on how to proceed to accomplish the requirements from point 2, as this is a sensitive step and must not make any mistakes :)


Many thanks and kind regards,
Patrick