Subject | Creating GB18030 character set and collation |
---|---|
Author | dhay@lexmark.com |
Post date | 2004-01-15T19:51:56Z |
Hi Peter,
Okay, now we've discovered that GB18030 supercedes GB2312!!
So, I now have the task of coming up with a GB18030 character set and
collation. Good news, though, is that 18030 is a superset of 2312.
I was wondering if you would mind doing a bit of hand-holding as I get
started on this?! If we are successful, I'll gladly give the character set
and collation to Firebird (am sure my boss won't mind, but haven't asked
him!). I really am out of my depth, and trying to get my head around all
this character set stuff. Sounds like I haven't picked an easy one to get
started on either!! Thankfully we have a Chinese guy in the office who
should be able to help me some.
I took a quick look at the GB_2312 stuff in the source tree (my C++ is
pretty rusty, so forgive me if this is all wrong!). Appears that
everything is just converted to and from unicode through mapping array?
Korean seems to be different, in that it has the LCKSC_string_to_key
function?
So, to add new character set, do I need new versions of
int/lc_gb2312.cpp
intl/cv_gb2312.h
intl/cv_gb2312.cpp
intl/charsets/cs_gb2312.h
and then to add a collation, a file like those found in intl/collations?
There also seems to be an issue of 3 and 4 byte chars! Not sure if we can
get away with ignoring them. If not, is there an easy way to handle them?
These articles seem to be pretty exhaustive (not that I understand it
all!):
ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf
http://oss.software.ibm.com/icu/docs/papers/gb18030.html (has some
algorithms in it)
and then the mapping to unicode in xml format can be found at:
http://oss.software.ibm.com/cvs/icu//charset/data/xml/gb-18030-2000.xml?rev=1.4&content-type=text/x-cvsweb-markup
Any advice you can offer and pointing in the right direction would be
great!!
Cheers,
David
|---------+---------------------------->
| | "peter_jacobi.rm"|
| | <peter_jacobi@gmx|
| | .net> |
| | |
| | 11/21/2003 05:26 |
| | PM |
| | Please respond to|
| | firebird-support |
| | |
|---------+---------------------------->
| To: firebird-support@yahoogroups.com |
| cc: |
| Subject: [firebird-support] Chinese/Korean (was Re: Non - printable characters in Stored Procedures) |
Okay, now we've discovered that GB18030 supercedes GB2312!!
So, I now have the task of coming up with a GB18030 character set and
collation. Good news, though, is that 18030 is a superset of 2312.
I was wondering if you would mind doing a bit of hand-holding as I get
started on this?! If we are successful, I'll gladly give the character set
and collation to Firebird (am sure my boss won't mind, but haven't asked
him!). I really am out of my depth, and trying to get my head around all
this character set stuff. Sounds like I haven't picked an easy one to get
started on either!! Thankfully we have a Chinese guy in the office who
should be able to help me some.
I took a quick look at the GB_2312 stuff in the source tree (my C++ is
pretty rusty, so forgive me if this is all wrong!). Appears that
everything is just converted to and from unicode through mapping array?
Korean seems to be different, in that it has the LCKSC_string_to_key
function?
So, to add new character set, do I need new versions of
int/lc_gb2312.cpp
intl/cv_gb2312.h
intl/cv_gb2312.cpp
intl/charsets/cs_gb2312.h
and then to add a collation, a file like those found in intl/collations?
There also seems to be an issue of 3 and 4 byte chars! Not sure if we can
get away with ignoring them. If not, is there an easy way to handle them?
These articles seem to be pretty exhaustive (not that I understand it
all!):
ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf
http://oss.software.ibm.com/icu/docs/papers/gb18030.html (has some
algorithms in it)
and then the mapping to unicode in xml format can be found at:
http://oss.software.ibm.com/cvs/icu//charset/data/xml/gb-18030-2000.xml?rev=1.4&content-type=text/x-cvsweb-markup
Any advice you can offer and pointing in the right direction would be
great!!
Cheers,
David
|---------+---------------------------->
| | "peter_jacobi.rm"|
| | <peter_jacobi@gmx|
| | .net> |
| | |
| | 11/21/2003 05:26 |
| | PM |
| | Please respond to|
| | firebird-support |
| | |
|---------+---------------------------->
>---------------------------------------------------------------------------------------------------------------|| |
| To: firebird-support@yahoogroups.com |
| cc: |
| Subject: [firebird-support] Chinese/Korean (was Re: Non - printable characters in Stored Procedures) |
>---------------------------------------------------------------------------------------------------------------|Hi David,
--- In firebird-support@yahoogroups.com, dhay@l... wrote:
> > so I can only hope, that GB_2312 in its binary sort order
> > is of some use.
> Dare I ask what if it's not?!! How easy would it be to create/add a
> collation? We have some Chinese colleagues here...
You can't add a new collation by data tables only (you
could vote on this request for enhancement, if we
had voting in sourceforge bug tracker...).
So you can add new collation by either expanding the
fbintl.dll or adding a fbintl2.dll.
There is sample fbintl2.dll source code at
http://groups.yahoo.com/group/Firebird-Architect/files/charsets_and_collations/
Also be invited to have a look in the source tree:
http://cvs.sourceforge.net/viewcvs.py/firebird/
The relevant stuff is in firebird2/src/intl
looking into the source for Korean:
http://cvs.sourceforge.net/viewcvs.py/*checkout*/firebird/firebird2/src/intl/lc_ksc.cpp?content-type=text%2Fplain&rev=1.7
You see the function LCKSC_string_to_key. This is where the
important stuff happens. The string is converted by language
dependent rules into a byte string, which can be compared
using normal binary comparison.
> > I've got at least one other support question for
> > GB_2312, and after clarifyinmg some basics, no further
> > complaints.
> Can I ask what?
Essentially, that Firebird doesn't do any install locale
depending automatic setup and the character sets must
be specified 'manually'. Which leads to your next
question:
> Do you have any advice on setting up a single application
> that will be used in multiple countries ie best way to
> change char sets depending on install
> locale?
Join the efforts to make Firebird 2.0 I18n more
userfriendly?
Seriously, you need a large table mapping charsets and
locales to the best matching Firebird charset and collation.
The best match found is used for table creation and as
connection charset.
Regards,
Peter Jacobi
To unsubscribe from this group, send an email to:
firebird-support-unsubscribe@yahoogroups.com
Your use of Yahoo! Groups is subject to http://docs.yahoo.com/info/terms/