Editing unicode (UTF-8) data

Warning

A record layout is required to recognize data as Unicode data. Therefore, we recommend to edit Unicode data only in Formatted (FMT) or Vertical Formatted (VFMT) display modes.

Unicode (UTF-8) data is converted and displayed as an EBCDIC text image when the data is located on a COBOL USAGE UTF8 or PL/I UCHAR field in a formatted display.

The following table shows how the Unicode data is displayed in each mode.

Display of Unicode (UTF-8) Data in Browse and Edit

	Data
	Ex. PIC U(2) or PIC U BYTE-LENGTH 8
Mode	Valid Data	Invalid Data
Mode	Sample: X'41E0A8A042E0A8A1'	Sample: X'E0A8A0E0A8A1E0A8'
FMT	<Browse> ¹ A.B.<Edit> A B	X'E0A8A0E0A8A1E0A8'
VFMT- HEX OFF	<Browse> A.B.<Edit> A B	U-INVALI
VFMT- HEX ON	A. B.² 4EAA4EAA 10802080	U-INVALI³ EAAEAAEA 08008108

¹ : The non-displayable substitution character is converted to a period in Browse and also in vertically formatted (VFMT) Edit with HEX ON. In Edit, if there is a substitution character, it is replaced by a blank, but the entire value is protected.
² : The character line (first line) is protected in HEX ON display in Browse and Edit.
³ : The substitution character is always displayed as a period in browse mode and in VFMT with HEX ON.

If Unicode Conversion fails for any reason, File-AID displays the original Unicode data in HEX format if FMT mode, displays U-INVALID if VFMT mode.

Formatted Displays

For the two modes for formatted display, formatted (FMT) and vertically formatted (VFMT) the following changes apply to:

SHOW PICTURE

For a UTF-8 field, the Picture column shows UT8B(nn) or UT8C(nn) with the representation of the data declaration. For example, PICTURE U(5) would be displayed as PICTURE of UT8C(5) and PICTURE U BYTE-LENGTH 5 would be displayed as PICTURE of UT8B(5).

SHOW PICTURE for Unicode (UTF-8) Data

SHOW FORMAT

The format for a UTF-8 field is nn/UT8B or nn/UT8C with nn being the length of the field. For example, PICTURE U BYTE-LENGTH 5 displays as FORMAT of 5/UT8B.

SHOW FORMAT for Unicode (UTF-8) Data

INIT command

When you issue the INIT command (see INIT), File-AID/MVS will initialize UTF-8 fields as follows:

UTF-8 field (ex. PIC U with and without BYTE-LENGTH) is initialized with Unicode (UTF-8) blank characters (X'20').

FIND and CHANGE commands

When you issue the FIND (see FIND-F) or CHANGE command (see CHANGE-CHG-C) for Unicode (UTF-8) data, File-AID/MVS has these restrictions:

Only supports hex format.
The FIND parameters VALID and INVALID are not supported.
In FMT mode and VFMT HEX OFF mode, the cursor does not point to the exact position of the found string.

Example:

To find the number 611 in a Unicode (UTF-8) field, enter this FIND command:

F X'363131'

SORT Order

The collating sequence of Unicode (UTF-8) is different than that of EBCDIC. The SORT command allows you to reorder the data. The SORT command always operates on the underlying data; thus, when the data is Unicode, the results may be different than for EBCDIC data.

The following table shows the difference between Unicode (UTF-8) order and EBCDIC order.

SORT Order for EBCDIC and Unicode UTF-8

EBCDIC		Unicode UTF-8
Order	HEX Value	Order	HEX Value
Space	X’40’	Space	X’20’
Numbers (0 to 9)	X’F0’ to X’F9’	Numbers (0 to 9)	X’30’ to X’39’
Lowercase letters (a to z)	X’81’ to X’89’ X’91’ to X’99’ X’A2’ to X’A9’	Uppercase letters (A to Z)	X’41’ to X’5A’
Uppercase letters (A to Z)	X’C1’ to X’C9’ X’D1’ to X’D9’ X’E2’ to X’E9’	Lowercase letters (a to z)	X’61’ to X’7A’
		Double byte Uppercase letters (A to Z)	X'C1C1' to X'C1DA'
		Double byte Lowercase letters (a to z)	X'C1E1' to X'C1FA'

Character Display Line for Unicode UTF-8

File-AID/MVS recognizes Unicode data fields and displays the correct character representation data for the Unicode data, based upon the active code page. For each Unicode field, the Unicode data is converted to the appropriate CCSID.

Once the data has been converted, the normal File-AID/MVS processing will be used to determine if the data is valid. When the data is valid, the character defined in the active code page will be displayed. When any of the characters is unprintable it will display as a period (.) in browse. In edit, unprintable characters will be displayed as a blank and the entire field will be protected. If a field contains any multi-byte UTF-8 character the entire field will be protected.

When any of the characters are invalid, the field will be displayed in HEX in FMT mode and U-INVALID will be displayed in VFMT mode.

In VFMT mode, values in the character display line are the converted EBCDIC-based data from the Unicode hexadecimal values. In hexadecimal format, you cannot overtype the values in the character display line. Each data position of the values is adjusted to be matched with the corresponding Unicode hexadecimal value, since Unicode data length may be different than the converted EBCDIC-based data length. For example: In EBCDIC, the double byte UTF-8 character data ABC is 3 bytes; in Unicode UTF-8 the same data is 6 bytes (C1C1C1C2C1C3). In vertical format the hex value is displayed vertically:

A B C
CCCCCC
111213

The following figure shows Unicode data displayed in Vertical Formatted display in hexadecimal format.

Unicode (UTF-8) Data in Vertical Formatted display with HEX ON

File-AID/MVS Data Validation

File-AID/MVS uses internal character set tables to determine if data contains unprintable characters. Several tables for different languages are shipped with the product. For online validation, the Character Set to be used is specified under the Parameters option for System Parameters. For batch validation, the CHARSET parameter is used to identify the Character Set. See the Install section for more information.

Printing Unicode Data

Use the CCSID parameter when printing Unicode data to specify the code page to be used. Manually add the parameter to the print JCL to override the default CCSID, for example:

$$DD01 VPRINT SHOW=F,LAYOUT=NATL,OUT=0,TRUNC=NO,
FILLER=ON,ZERO=OFF,CCSID=1140