unicode

MKS Toolkit Unicode support 

Miscellaneous Information


DESCRIPTION

Besides normal ASCII text files, MKS Toolkit utilities also support UTF-8 file and 16-bit wide Unicode files (that is, files using UTF-8 characters and 16-bit wide Unicode characters, respectively). You can also include such characters on MKS Toolkit command lines and in path names. MKS Toolkit utilities cannot handle non-OEM characters in file names unless the locale supports double byte character (such as the Japanese locale). Consequently, even though the utilities support UTF-8 and Unicode characters in files on all platforms, to achieve maximum portability across all Windows platforms, all file names used in scripts for utilities like awk, sh, csh and others should contain only ASCII characters from the OEM code page.

Normally, when a file is read by an MKS Toolkit utility, the utility determines its format and the type of characters it contains and will use that same format for any output it produces. The key to determining the file format is the multiple-byte marker usually found at the beginning of UTF-8 and Unicode files. This marker indicates whether the file's contents are Unicode big-endian, Unicode little-endian, or UTF-8. However, when the multiple-byte marker is not present, you can set the TK_STDIO_DEFAULT_INPUT_FORMAT and TK_STDIO_DEFAULT_OUTPUT_FORMAT environment variables (see ENVIRONMENT VARIABLES) to force any input or output to be treated as Unicode or UTF-8. Additionally, some utilities (such as cat, more, or tail) feature a -U option that lets you specify how to handle input and output on the utility's command line.

When multiple input files are specified for a utility, the format of the output generated is normally the same as the first input file specified. There are, however, two exceptions to this.

When the utility first reads all input files, processes them, and then generates output (for example, diff), the output is normally in the format of the first specified input file unless the input files are a mix of ASCII and Unicode/UTF-8 formats. In that case, the output format is the format of the first non-ASCII input file specified.

When multiple input files are being read and multiple output files are being generated (as can often be the case with awk and perl scripts), the format of a given output file (or standard output) depends upon what input files have been read at the time of the output file's creation. If only ASCII format input files have at that time, the output file is created using the format of the first ASCII file read. If, however, only non-ASCII (Unicode or UTF-8) or mix of ASCII and non-ASCII files have been read, the output file is created using the format of the first non-ASCII input file read.

When deciding how files should be treated, the input and output code pages are also taken into account. MKS Toolkit supports the system OEM and ANSI code pages and normally sets the output code page to match the input code page.

For the TK_STDIO_DEFAULT_INPUT_FORMAT and TK_STDIO_DEFAULT_OUTPUT_FORMAT environment variables, the true meaning of the value ASCII depends upon the appropriate input or output code page. If the code page is the system ANSI code page, ASCII is equivalent to ASCII_ANSI; otherwise, it is equivalent to ASCII_OEM.

For ASCII input, when TK_STDIO_DEFAULT_INPUT_FORMAT is not set and the input code page is the system ANSI code page, the input is assumed to be in ASCII_ANSI format. Otherwise, it is assumed to be in ASCII_OEM format.

For ASCII output, when TK_STDIO_DEFAULT_OUTPUT_FORMAT is not set, and the output code page is the system ANSI code page, the output is written in ASCII_ANSI format. Otherwise it is assumed to be in ASCII_OEM format.

Note:

To change the input and output code page, use the stty utility's cp command (for example, stty cp 437) . To display a list of available code pages on your system or display the current code pages, use the sysinf utility's codepages command (for example, sysinf codepages -c).

By default, Windows systems use the OEM code page for all console input and output. Most MKS Toolkit non-graphical utilities use console input/output as does viw (for compatibility with vi). As a result, files that have been saved in ANSI format (for example, by Notepad) may not be correctly read or displayed by these utilities. To read and display these files correctly, you can do one of the following:

Finally, the font selected for a console window may also affect how characters are displayed. For example, the Lucida Console font allows a greater number of characters to be displayed. When a character cannot be displayed in the console window using the current font, the system default character is displayed in its place.

File Character Formats

MKS Toolkit supports text files with characters stored in a variety of formats. MKS Toolkit utilities support specifying the precise input and output format of files handled through the use of environment variables (see ENVIRONMENT VARIABLES, the -U option found in several utilities (such as diff, more, and wc, and various options and commands in vi. A variety of values can be used to express file character formats. While all these values can be used with the environment variables and vi, only the single character values can be with -U. All values are case insensitive.

The following values indicate ASCII files:

ASCII_ANSI               ASCII characters from the ANSI code page
ASCII_OEM                ASCII characters form the OEM code page
ASCII                    same as ASCII_OEM or ASCII_ANSI depending
                         on input code page
ANSI                     same as ASCII_ANSI
OEM                      same as ASCII_OEM
A                        same as ASCII_ANSI
O                        same as ASCII_OEM

The following values indicate Unicode/UTF-8 (non-ASCII) files:

UNICODE_BIG_ENDIAN       Big endian 16-bit wide characters
UNICODE_LITTLE_ENDIAN    Little endian 16-bit wide characters
UTF-8                    UTF-8 characters
UNICODE                  same as UNICODE_LITTLE_ENDIAN
L                        same as UNICODE_LITTLE_ENDIAN
B                        same as UNICODE_BIG_ENDIAN
UTF8                     same as UTF-8
8                        same as UTF-8

ENVIRONMENT VARIABLES

TK_ARCHIVE_CHARSET 

Contains the format to be used by cpio, tar, pax, vpax, zip, or unzip when reading and writing file names to an archive. The value must be one of ASCII_ANSI, ASCII_OEM, or UTF-8 (or their equivalents) as described in the File Character Formats section above.

When this variable is unset or it is set to a value other than those listed earlier, the default OEM character set is used.

TK_CMDSUB_FORMAT 

Contains the format to be used for the output from command substitution in MKS KornShell (the `command_line` and $(command_line) structures) and MKS C Shell (the `command_line` structure). The value must be one of those listed in the File Character Formats section above.

When TK_CMDSUB_FORMAT is not set, the value of the TK_STDIO_DEFAULT_INPUT_FORMAT environment variable is used as the default format.

When TK_CMDSUB_FORMAT and TK_STDIO_DEFAULT_INPUT_FORMAT are both unset, either ASCII_OEM or ASCII_ANSI is used as the default format, as dictated by the current code page. This provides compatibility with older versions of MKS Toolkit.

TK_HEREDOC_FORMAT 

Contains the format to be used for here documents in MKS KornShell and MKS C Shell. The value must be one of those listed in the File Character Formats section above.

When TK_HEREDOC_FORMAT and TK_STDIO_DEFAULT_OUTPUT_FORMAT are both unset, here documents are assumed to use ASCII_OEM characters. This provides compatibility with older versions of MKS Toolkit.

When TK_STDIO_DEFAULT_OUTPUT_FORMAT is set to a Unicode/UTF-8 format and you are feeding a here document to to a non-MKS Toolkit utility that won't understand its format, you should set TK_HEREDOC_FORMAT to ASCII_OEM and export it.

When TK_HEREDOC_FORMAT is unset or TK_STDIO_DEFAULT_OUTPUT_FORMAT is set to an ASCII format and your here document contains non-ASCII (OEM) characters, you should set TK_HEREDOC_FORMAT to UTF-8 and export it.

This variable takes precedence over TK_STDIO_DEFAULT_OUTPUT_FORMAT.

TK_STDIO_DEFAULT_INPUT_FORMAT 

Sets the default input format for files that don't have the initial multibyte marker. The value must be one of those listed in the File Character Formats section above.

TK_STDIO_DEFAULT_OUTPUT_FORMAT 

Sets the default output format. Normally the format of the first file read is used as the default output format. The value must be one of those listed in the File Character Formats section above.


AVAILABILITY

PTC MKS Toolkit for Power Users
PTC MKS Toolkit for System Administrators
PTC MKS Toolkit for Developers
PTC MKS Toolkit for Interoperability
PTC MKS Toolkit for Professional Developers
PTC MKS Toolkit for Professional Developers 64-Bit Edition
PTC MKS Toolkit for Enterprise Developers
PTC Windchill Requirements and Validation


SEE ALSO

Commands:
cat, diff, grep, head, more, od, registry, shexec, strings, tail, unzip, vi, wc

File Formats:
magic


PTC MKS Toolkit 10.5 Documentation Build 40.