SYNOPSIS
#include <tcl.h>
Tcl_Encoding Tcl_GetEncoding(interp, name)
void Tcl_FreeEncoding(encoding)
char * Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr)
int Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
char * Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr)
int Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr, dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
char * Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr)
TCHAR * Tcl_WinUtfToTChar(src, srcLen, dstPtr)
char * Tcl_GetEncodingName(encoding)
int Tcl_SetSystemEncoding(interp, name)
void Tcl_GetEncodingNames(interp)
Tcl_Encoding Tcl_CreateEncoding(typePtr)
char * Tcl_GetDefaultEncodingDir(void)
void Tcl_SetDefaultEncodingDir(path)
ARGUMENTS
- Tcl_Interp *interp (in)
-
Interpreter to use for error reporting, or NULL if no error reporting is desired.
- CONST char *name (in)
-
Name of encoding to load.
- Tcl_Encoding encoding (in)
-
The encoding to query, free, or use for converting text. If encoding is NULL, the current system encoding is used.
- CONST char *src (in)
-
For the
Tcl_ExternalToUtf() functions, an array of bytes in the specified encoding that are to be converted to UTF-8. For theTcl_UtfToExternal() andTcl_WinUtfToTChar() functions, an array of UTF-8 characters to be converted to the specified encoding. - CONST TCHAR *tsrc (in)
-
An array of Windows TCHAR characters to convert to UTF-8.
- int srcLen (in)
-
Length of src or tsrc in bytes. If the length is negative, the encoding-specific length of the string is used.
- Tcl_DString *dstPtr (out)
-
Pointer to an uninitialized or free Tcl_DString in which the converted result will be stored.
- int flags (in)
-
Various flag bits OR-ed together. TCL_ENCODING_START signifies that the source buffer is the first block in a (potentially multi-block) input stream, telling the conversion routine to reset to an initial state and perform any initialization that needs to occur before the first byte is converted. TCL_ENCODING_END signifies that the source buffer is the last block in a (potentially multi-block) input stream, telling the conversion routine to perform any finalization that needs to occur after the last byte is converted and then to reset to an initial state. TCL_ENCODING_STOPONERROR signifies that the conversion routine should return immediately upon reading a source character that doesn't exist in the target encoding; otherwise a default fallback character will automatically be substituted.
- Tcl_EncodingState *statePtr (in/out)
-
Used when converting a (generally long or indefinite length) byte stream in a piece by piece fashion. The conversion routine stores its current state in *statePtr after src (the buffer containing the current piece) has been converted; that state information must be passed back when converting the next piece of the stream so the conversion routine knows what state it was in when it left off at the end of the last piece. May be NULL, in which case the value specified for flags is ignored and the source buffer is assumed to contain the complete string to convert.
- char *dst (out)
-
Buffer in which the converted result will be stored. No more than dstLen bytes will be stored in dst.
- int dstLen (in)
-
The maximum length of the output buffer dst in bytes.
- int *srcReadPtr (out)
-
Filled with the number of bytes from src that were actually converted. This may be less than the original source length if there was a problem converting some source characters. May be NULL.
- int *dstWrotePtr (out)
-
Filled with the number of bytes that were actually stored in the output buffer as a result of the conversion. May be NULL.
- int *dstCharsPtr (out)
-
Filled with the number of characters that correspond to the number of bytes stored in the output buffer. May be NULL.
- Tcl_EncodingType *typePtr (in)
-
Structure that defines a new type of encoding.
- char *path (in)
-
A path to the location of the encoding file.
INTRODUCTION
These routines convert between Tcl's internal character representation, UTF-8, and character representations used by various operating systems or file systems, such as Unicode, ASCII, or Shift-JIS. When operating on strings, such as such as obtaining the names of files or displaying characters using international fonts, the strings must be translated into one or possibly multiple formats that the various system calls can use. For instance, on a Japanese UNIX workstation, a user might obtain a filename represented in the EUC-JP file encoding and then translate the characters to the jisx0208 font encoding in order to display the filename in a Tk widget. The purpose of the encoding package is to help bridge the translation gap. UTF-8 provides an intermediate staging ground for all the various encodings. In the example above, text would be translated into UTF-8 from whatever file encoding the operating system is using. Then it would be translated from UTF-8 into whatever font encoding the display routines require.
Some basic encodings are compiled into Tcl. Others can be defined by the user or dynamically loaded from encoding files in a platform-independent manner.
DESCRIPTION
The encoding package maintains a database of all encodings currently in use.
The first time name is seen,
When an encoding is no longer needed,
- TCL_OK
-
All bytes of src were converted.
- TCL_CONVERT_NOSPACE
-
The destination buffer was not large enough for all of the converted data; as many characters as could fit were converted though.
- TCL_CONVERT_MULTIBYTE
-
The last fews bytes in the source buffer were the beginning of a multibyte sequence, but more bytes were needed to complete this sequence. A subsequent call to the conversion routine should pass a buffer containing the unconverted bytes that remained in src plus some further bytes from the source stream to properly convert the formerly split-up multibyte sequence.
- TCL_CONVERT_SYNTAX
-
The source buffer contained an invalid character sequence. This may occur if the input stream has been damaged or if the input encoding method was misidentified.
- TCL_CONVERT_UNKNOWN
-
The source buffer contained a character that could not be represented in the target encoding and TCL_ENCODING_STOPONERROR was specified.
If you planned to use the same "char" based interfaces on both Windows
95 and Windows NT, you could use
if (running NT) { encoding <- Tcl_GetEncoding("unicode"); nativeBuffer <- Tcl_UtfToExternal(encoding, utfBuffer); Tcl_FreeEncoding(encoding); } else { nativeBuffer <- Tcl_UtfToExternal(NULL, utfBuffer);
The typePtr argument to
typedef struct Tcl_EncodingType { CONST char *encodingName; Tcl_EncodingConvertProc *toUtfProc; Tcl_EncodingConvertProc *fromUtfProc; Tcl_EncodingFreeProc *freeProc; ClientData clientData; int nullSize; } Tcl_EncodingType;
The encodingName provides a string name for the encoding, by
which it can be referred in other procedures such as
The callback procedures toUtfProc and fromUtfProc should match the type Tcl_EncodingConvertProc:
typedef int Tcl_EncodingConvertProc( ClientData clientData, CONST char *src, int srcLen, int flags, Tcl_Encoding *statePtr, char *dst, int dstLen, int *srcReadPtr, int *dstWrotePtr, int *dstCharsPtr);
The toUtfProc and fromUtfProc procedures are
called by the
The callback procedure freeProc, if non-NULL, should match the type Tcl_EncodingFreeProc:
typedef void Tcl_EncodingFreeProc( ClientData clientData);
This freeProc function is called when the encoding is deleted.
The
clientData parameter is the same as the
clientData field
specified to
ENCODING FILES
Space would prohibit precompiling into Tcl every possible encoding algorithm, so many encodings are stored on disk as dynamically-loadable encoding files. This behavior also allows the user to create additional encoding files that can be loaded using the same mechanism. These encoding files contain information about the tables and/or escape sequences used to map between an external encoding and Unicode. The external encoding may consist of single-byte, multi-byte, or double-byte characters.
Each dynamically-loadable encoding is represented as a text file. The initial line of the file, beginning with a # symbol, is a comment that provides a human-readable description of the file. The next line identifies the type of encoding file. It can be one of the following letters:
- [1] S
-
A single-byte encoding, where one character is always one byte long in the encoding. An example is iso8859-1, used by many European languages.
- [2] D
-
A double-byte encoding, where one character is always two bytes long in the encoding. An example is big5, used for Chinese text.
- [3] M
-
A multi-byte encoding, where one character may be either one or two bytes long. Certain bytes are a lead bytes, indicating that another byte must follow and that together the two bytes represent one character. Other bytes are not lead bytes and represent themselves. An example is shiftjis, used by many Japanese computers.
- [4] E
-
An escape-sequence encoding, specifying that certain sequences of bytes do not represent characters, but commands that describe how following bytes should be interpreted.
The rest of the lines in the file depend on the type.
Cases [1], [2], and [3] are collectively referred to as table-based encoding files. The lines in a table-based encoding file are in the same format as this example taken from the shiftjis encoding (this is not the complete file):
# Encoding file: shiftjis, multi-byte M 003F 0 40 00 0000000100020003000400050006000700080009000A000B000C000D000E000F 0010001100120013001400150016001700180019001A001B001C001D001E001F 0020002100220023002400250026002700280029002A002B002C002D002E002F 0030003100320033003400350036003700380039003A003B003C003D003E003F 0040004100420043004400450046004700480049004A004B004C004D004E004F 0050005100520053005400550056005700580059005A005B005C005D005E005F 0060006100620063006400650066006700680069006A006B006C006D006E006F 0070007100720073007400750076007700780079007A007B007C007D203E007F 0080000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 81 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C 301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000 00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5 FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6 25A125A025B325B225BD25BC203B301221922190219121933013000000000000 000000000000000000000000000000002208220B2286228722822283222A2229 000000000000000000000000000000002227222800AC21D221D4220022030000 0000000000000000000000000000000000000000222022A52312220222072261 2252226A226B221A223D221D2235222B222C0000000000000000000000000000 212B2030266F266D266A2020202100B6000000000000000025EF000000000000
The third line of the file is three numbers. The first number is the fallback character (in base 16) to use when converting from UTF-8 to this encoding. The second number is a 1 if this file represents the encoding for a symbol font, or 0 otherwise. The last number (in base 10) is how many pages of data follow.
Subsequent lines in the example above are pages that describe how to map
from the encoding into 2-byte Unicode. The first line in a page identifies
the page number. Following it are 256 double-byte numbers, arranged as 16
rows of 16 numbers. Given a character in the encoding, the high byte of
that character is used to select which page, and the low byte of that
character is used as an index to select one of the double-byte numbers in
that page
Following the first page will be all the other pages, each in the same format as the first: one number identifying the page followed by 256 double-byte Unicode characters. If a character in the encoding maps to the Unicode character 0000, it means that the character doesn't actually exist. If all characters on a page would map to 0000, that page can be omitted.
Case [4] is the escape-sequence encoding file. The lines in an this type of file are in the same format as this example taken from the iso2022-jp encoding:
# Encoding file: iso2022-jp, escape-driven E init {} final {} iso8859-1 \x1b(B jis0201 \x1b(J jis0208 \x1b$@ jis0208 \x1b$B jis0212 \x1b$(D gb2312 \x1b$A ksc5601 \x1b$(C
In the file, the first column represents an option and the second column is the associated value. init is a string to emit or expect before the first character is converted, while final is a string to emit or expect after the last character. All other options are names of table-based encodings; the associated value is the escape-sequence that marks that encoding. Tcl syntax is used for the values; in the above example, for instance, {} represents the empty string and \x1b represents character 27.
When
PORTABILITY
Windows 8.1. Windows Server 2012 R2. Windows 10. Windows Server 2016. Windows Server 2019. Windows 11. Windows Server 2022.
AVAILABILITY
PTC MKS Toolkit for Professional Developers
PTC MKS Toolkit for Enterprise Developers
PTC MKS Toolkit for Enterprise Developers 64-Bit Edition
PTC MKS Toolkit 10.4 Documentation Build 39.