Unicode in AR System

The topic discusses how works with Unicode.

How converts character sets

Important

It is possible to convert all characters from any character set that supports into Unicode. However, it is not possible to convert all Unicode characters into any other single character set. Where such conversion is not possible, replaces the characters in question with another character.

Where conversion between code sets occurs in depends on which versions of the clients you are running and whether the is running in Unicode. Possible combinations are as follows:

When you run a pre-7.0.00 client against a 7.x or later running in Unicode, the AR System server converts data to the codeset it would use if the server were not running in Unicode.
When you run a 7.x or later client against a running in Unicode, the AR System server transmits in UTF-8, and lets the API library code running in the client handle any conversion. If the client program expects UTF-8, the library need not do anything. If the client program expects some other codeset, the library converts the characters.
When you run a 7.x or later client against a 7.x or later not running in Unicode, the API library code running in the client handles any necessary conversion. If the client does not expect Unicode, no conversion is needed.

If one side (server or client) is running in Unicode, and the other is not, converts between the other code set and Unicode. For compatibility with existing practice, the system uses Windows code pages instead of the ISO-standard encodings usually used in UNIX to represent certain languages, as outlined in the following table.

Comparison of Windows code pages and ISO-standard encodings

Windows code page	ISO character set	Used with these languages
1252	8859-1 (Latin 1)	English and most Western European languages written in the Latin alphabet
1252	8859-15	Same as Latin 1 but with Euro symbol
1251	8859-5	Russian and other languages written in the Cyrillic alphabet
1250	8859-2 (Latin 2)	Polish, Czech, and other Central European languages
1257	8859-4	Baltic languages
1254	8859-9	Turkish
1253	8859-7	Greek

If determines that the client is running in Shift-JIS (universal, for Japanese Windows systems) and the server is running in EUC-J (Japanese UNIX systems), it converts characters between these encodings.

For other double-byte languages, supports:

Traditional Chinese using the Big5 character encoding ( converts characters between Big5 and Unicode as needed.)
Simplified Chinese using the GB2312 character encoding ( converts characters between GB2312 and Unicode as needed.)
Korean using the EUC-KR character encoding ( converts characters between EUC-KR and Unicode as needed.)

The system does not support any other character-set conversions between servers and clients. To prevent errors and data loss, the character encodings between clients and servers must match.

How field widths are determined

Unicode servers store characters in databases using UTF-8 (Oracle database) or UTF-16 (Microsoft SQL Server databases). Being a byte-oriented character encoding, UTF-8 is similar to other byte-oriented encodings that supports, such as Shift-JIS and EUC (for Japanese) or GB2312 (for Simplified Chinese).

However, a character sequence encoded in UTF-8 tends to be longer than the same characters in one of the other encodings. Also, characters from European languages, which occupy 1 byte each in the other encodings, can occupy 1 or 2 bytes in UTF-8, as outlined in the following table.

How characters are expanded in UTF-8

Characters	Expansion in UTF-8	Notes
ASCII	1	Every ASCII character represents itself in UTF-8.
European	1-2	European texts combine ASCII with accented and inflected letters and special punctuation. The actual expansion depends on the text itself, and somewhat on the language. For example, Italian text is closer to 1 because it uses relatively few accented letters; Russian text is closer to 2 because nearly all Russian words are spelled with non-ASCII (2-byte) characters, leaving only spaces and punctuation as 1-byte characters.
Chinese, Korean	1-2	Chinese and Korean encodings use 2 bytes for each character; the same characters in UTF-8 occupy 3 bytes. The actual expansion is near 1.5 unless the text heavily uses ASCII (which makes the expansion slightly smaller).
Japanese	1-3	On average, this expansion is 1.5 as with Chinese and Korean: Most Japanese characters occupy 2 bytes in a codeset like Shift-JIS and 3 bytes in UTF-8. EUC (the Japanese encoding historically used on UNIX systems) offers 3-byte forms. If your text has many of these, the expansion is correspondingly smaller. The Japanese language is written using kanji (the full-width characters resembling Chinese text), but also uses two other writing systems known as hiragana and katakana. These "kana" characters occupy just 1 byte in Shift-JIS, and 3 bytes in UTF-8. So a character sequence with a high proportion of these expands by a factor of nearly 3. (Japanese punctuation and double-wide digits occupy 2 bytes in Shift-JIS and EUC, so they do not expand as much as kana do.)

Because of this expansion on conversion into UTF-8, data converted to UTF-8 might not be imported correctly because it no longer fits into the database columns. To avoid this problem, expand the sizes of the affected form fields.

In UTF-16, each Unicode character occupies 1 or 2 code units (each code unit is a 16-bit quantity). Each ASCII and European character occupies 1 code unit; each Chinese, Korean, and Japanese character, which might be 2 bytes in its language-specific encoding, also occupies 1 code unit.

These expansions are valid for the characters of Unicode's Basic Multilingual Plane (BMP) — the original set of 65,536 characters presented in Unicode 1.0 and modified in Unicode 2.0. Since version 3.0, Unicode provides a mechanism to define up to about 1 million supplemental characters. Supplemental characters are defined for Chinese and also for some specialized usages in mathematics, musical typesetting, and information processing. Each supplemental character occupies 4 bytes in UTF-8, and 2 code units in UTF-16.

How serialized strings are encoded and decoded

servers encode and decode serialized strings using character lengths implied by the server's character encoding. The server encodes and decodes these strings using the Application-Parse-Qual and Application-Format-Qual actions that filters and active links perform. The ARDecodeARQualifierStruct and ARDecodeARAssignStruct C API calls (and Encode variants of those functions) also process serialized strings.

This issue does not affect most serialized strings because they contain only ASCII characters. In every character encoding that supports, each ASCII character occupies exactly 1 byte. The lengths of non-ASCII characters depend on the character encoding in use. For example, a qualifier such as 'Submitter' LIKE "%à%" produces a serialized string that counts the à as 1 byte in the standard Windows "Code Page 1252" encoding, but 2 bytes in UTF-8. Clients and servers generate errors if presented with strings that have incorrect lengths.

Unicode composition and normalization

expects to receive characters in Unicode Normalization Form C, which uses the single-character forms for European accented letters and for Korean jamo characters. Form C is the same normalization form as expected by XML processors. This form is also the shortest; that is, text encoded in this form occupies fewer bytes than in other normalization forms. For more information, see the Unicode Consortium website at http://www.unicode.org.