NAME

Unicode::Transform - conversion among Unicode Transformation Formats

SYNOPSIS

    use Unicode::Transform ':all';

    $unicode_string = utf16be_to_unicode($utf16be_string);
    $utf16le_string = unicode_to_utf16le($unicode_string);
    $utf8_string    = utf32be_to_utf8   ($utf32be_string);

    $utf8_string    = utf32be_to_utf8(\&chr_utf8, $utf32be_string);
         # ill-formed octet sequences are allowed.

DESCRIPTION

This module provides some functions to convert a string among some Unicode Transformation Formats (UTF).

Conversion Between UTF

(Exporting: use Unicode::Transform ':conv';)

<SRC_UTF_NAME>_to_<DST_UTF_NAME>([CALLBACK,] STRING)

Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME.

Function names

A function name consists of SRC_UTF_NAME, a string '_to_', and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one in the list of hyphen-removed and lowercased names following:

    unicode    (for Perl internal Unicode encoding; see perlunicode)
    utf16le    (for UTF-16LE)
    utf16be    (for UTF-16BE)
    utf32le    (for UTF-32LE)
    utf32be    (for UTF-32BE)
    utf8       (for UTF-8)
    utf8mod    (for UTF-8-Mod)
    utfcp1047  (for CP1047-oriented UTF-EBCDIC).

In all, 64 (i.e. 8 times 8) functions are available. Available function names include utf16be_to_utf32le() and utf8_to_unicode(). DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8().

Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are defined well and stably. In contrast to these UTF, the Perl internal Unicode encoding is influenced by the platform-dependent features (e.g. 32bit/64bit, ASCII/EBCDIC).

Parameters

If the first parameter is a reference, that is regarded as the CALLBACK. Any reference will not allowed as STRING. If CALLBACK is given, the second parameter is STRING; otherwise the first is. Currently, only code references are allowed as CALLBACK.

If CALLBACK is omitted, only Unicode scalar values (0x0000..0xD7FF and 0xE000..0x10FFFF) are allowed. Ill-formed octet sequences (corresponding to a code point outside the range of Unicode scalar values) and partial octets (which does not correspond to any code point) are deleted, as if a code reference constantly returning an empty string, sub {''}, was used as CALLBACK.

Examples of partial octets: the first octet without following octets in UTF-8 like "\xC2"; the last octet in UTF-16BE,LE with odd number of octets.

If CALLBACK is specified, the appearance of an ill-formed octet sequences or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.

The return value from CALLBACK will be inserted there. You may use chr_<DST_UTF_NAME>() as CALLBACK (see below). Return value from CALLBACK should be in UTF of DST_UTF_NAME.

You can call die or croak in CALLBACK when you want to stop the operation if the whole STRING would not be well-formed.

Conversion from Code Point to String

(Exporting: use Unicode::Transform ':chr';)

chr_<DST_UTF_NAME>(CODEPOINT)

Returns a string in DST_UTF_NAME corresponding to CODEPOINT. CODEPOINT should be an unsigned integer. If CODEPOINT is outside the range of Unicode scalar values, a corresponding ill-formed octet sequence will be returned.

If CODEPOINT is greater than the maximum value, returns undef. The maximum value of CODEPOINT is:

    0x0010_FFFF for chr_utf16le() and chr_utf16be()
    0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047()
    0xFFFF_FFFF for chr_utf32le(), chr_utf32be()

The maximum value of CODEPOINT for chr_unicode() depends on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC).

Function names

The full list of functions provided:

chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)

Numeric Value of the First Character

(Exporting: use Unicode::Transform ':ord';)

ord_<SRC_UTF_NAME>(STRING)

Returns an unsigned integer value of the first character of STRING in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence corresponding to a surrogate code point (0xD800..0xDFFF) or an out-of-range code point (0x110000 and greater). If STRING is empty or begins at a partial octet, returns undef.

Function names

The full list of functions provided:

ord_unicode(STRING)
ord_utf16le(STRING)
ord_utf16be(STRING)
ord_utf32le(STRING)
ord_utf32be(STRING)
ord_utf8(STRING)
ord_utf8mod(STRING)
ord_utfcp1047(STRING)

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.