Unicode in AR System
AR System supports localization for multiple languages by using Unicode. AR System handles character set conversions when there is a difference between the code sets used by the client application and the AR System server.
The API library in the client application manages character set conversions. The AR System servertransmits data in UTF-8. If the client application is also set to receive UTF-8, no conversion is required. However, if the client application expects a different character set, the API library in the client handles the necessary conversion.
If either of the server or client includes support for Unicode, the AR System performs conversion between the code set and Unicode. For compatibility with existing practice, the system uses Windows code pages instead of the ISO-standard encodings usually used in UNIX to represent certain languages, as outlined in the following table.
Comparison of Windows code pages and ISO-standard encodings
Windows code page | ISO character set | Used with these languages |
---|---|---|
1252 | 8859-1 (Latin 1) | English and most Western European languages written in the Latin alphabet |
1252 | 8859-15 | Same as Latin 1 but with the Euro symbol |
1251 | 8859-5 | Russian and other languages written in the Cyrillic alphabet |
1250 | 8859-2 (Latin 2) | Polish, Czech, and other Central European languages |
1257 | 8859-4 | Baltic languages |
1254 | 8859-9 | Turkish |
1253 | 8859-7 | Greek |
If the client application runs in Shift-JIS (universal, for Japanese Windows systems) and the AR system server runs in EUC-JP (Japanese UNIX systems), the AR System converts characters directly between these encodings without involving Unicode.
AR System supports the following double-byte languages, converting characters between their respective encodings and Unicode as needed:
- Traditional Chinese using the Big5 character encoding
- Simplified Chinese using the GB2312 character encoding
- Korean using the EUC-KR character encoding
The AR System does not support any other character-set conversions between the AR Systemserver and the client application. To prevent errors and data loss, the character encodings between clients and servers must match.
Support for DIN 91379 character set
AR System is compliant with DIN 91379, the standard for characters and defined character sequences in Unicode for the electronic processing of names and data exchange in Europe. This compliance ensures that characters such as ü, ä, and ß are preserved without substitution or modification. Latin characters are represented literally, including all diacritical marks.
How field widths are determined
The AR System servers store characters in databases using UTF-8 (Oracle database) or UTF-16 (Microsoft SQL Server databases). Being a byte-oriented character encoding, UTF-8 is similar to other byte-oriented encodings that AR System supports, such as Shift-JIS and EUC-JP (for Japanese) or GB2312 (for Simplified Chinese).
However, a character sequence encoded in UTF-8 tends to be longer than the same characters in one of the other encodings. Also, characters from European languages, which occupy one byte each in the other encodings, can occupy one or two bytes in UTF-8, as outlined in the following table.
How characters are expanded in UTF-8
Characters | Expansion Factor for UTF-8 | Notes |
---|---|---|
ASCII | 1 | Every ASCII character represents itself in UTF-8. |
European | 1-2 | European texts combine ASCII with accented and inflected letters and special punctuation. The actual expansion depends on the text itself and partly on the language. For example, Italian text typically has an expansion factor closer to one because it uses relatively few accented letters; Russian text tends to have an expansion factor closer to two because nearly all Russian words are spelled with non-ASCII (2-byte) characters, leaving only spaces and punctuation as 1-byte characters. |
Chinese, Korean | 1-2 | Chinese and Korean encodings use two bytes for each character; the same characters in UTF-8 occupy three bytes. The expansion factor is approximately 1.5 unless the text heavily incorporates ASCII characters, resulting in a slightly smaller expansion. |
Japanese | 1-3 | On average, the expansion factor for Chinese and Korean characters is 1.5. Most Japanese characters occupy two bytes in encodings like Shift-JIS and three bytes in UTF-8. EUC-JP, the Japanese encoding historically used on UNIX systems, also offers 3-byte forms. The expansion is correspondingly smaller if your text has many Japanese characters. The Japanese language is written using kanji (the full-width characters resembling Chinese text), but also uses two other writing systems known as hiragana and katakana. These "kana" characters occupy just 1 byte in Shift-JIS and 3 bytes in UTF-8. So, a character sequence with a high proportion of these expands by a factor of nearly 3. (Japanese punctuation and double-wide digits occupy 2 bytes in Shift-JIS and EUC, so they do not expand as much as kana characters.) |
Because of this expansion on conversion into UTF-8, data converted to UTF-8 might not be imported correctly because it no longer fits into the database columns. To avoid this problem, expand the sizes of the affected form fields.
After the characters from a code set are converted to UTF-8, the size of the encoding increases. The database columns where these encodings are stored might not be large enough to hold the expanded characters after conversion to UTF-8. To avoid this problem, expand the sizes of the fields in the forms where the characters are converted to UTF-8, so that the database columns can accommodate the converted encodings.
In UTF-16, each Unicode character occupies 1 or 2 code units (each code unit is a 16-bit quantity). Each ASCII and European character occupies 1 code unit; each Chinese, Korean, and Japanese character, which might be 2 bytes in its language-specific encoding, also occupies 1 code unit.
These expansions are valid for the characters of Unicode's Basic Multilingual Plane (BMP) — the original set of 65,536 characters presented in Unicode 1.0 and modified in Unicode 2.0. Since version 3.0, Unicode provides a mechanism to define up to about 1 million supplemental characters. Supplemental characters are defined for Chinese and also for some specialized usages in mathematics, musical typesetting, and information processing. Each supplemental character occupies 4 bytes in UTF-8, and 2 code units in UTF-16.
How serialized strings are encoded and decoded
AR System servers encode and decode serialized strings using character lengths implied by the server's character encoding. The server encodes and decodes these strings using the Application-Parse-Qual and Application-Format-Qual actions that filters and active links perform. The ARDecodeARQualifierStruct and ARDecodeARAssignStruct C API calls (and Encode variants of those functions) also process serialized strings.
This issue does not affect most serialized strings because they contain only ASCII characters. In every character encoding that AR System supports, each ASCII character occupies exactly 1 byte. The lengths of non-ASCII characters depend on the character encoding in use. For example, a qualifier such as 'Submitter' LIKE "%à%" produces a serialized string that counts the à as 1 byte in the standard Windows "Code Page 1252" encoding, but 2 bytes in UTF-8. Clients and servers generate errors, if presented, with strings that have incorrect lengths.
Unicode composition and normalization
AR System expects to receive characters in Unicode Normalization Form C, which uses the single-character forms for European accented letters and for Korean jamo characters. Form C is the same normalization form as expected by XML processors. This form is also the shortest; that is, text encoded in this form occupies fewer bytes than in other normalization forms. For more information, see the Unicode Consortium website at http://www.unicode.org.