Unicode::Transform - conversion among Unicode Transformation Formats |
Unicode::Transform - conversion among Unicode Transformation Formats
use Unicode::Transform ':all';
$unicode_string = utf16be_to_unicode($utf16be_string); $utf16le_string = unicode_to_utf16le($unicode_string); $utf8_string = utf32be_to_utf8 ($utf32be_string);
$utf8_string = utf32be_to_utf8(\&chr_utf8, $utf32be_string); # ill-formed octet sequences are allowed.
This module provides some functions to convert a string among some Unicode Transformation Formats (UTF).
(Exporting: use Unicode::Transform ':conv';
)
<SRC_UTF_NAME>_to_<DST_UTF_NAME>([CALLBACK,] STRING)
Returns a string in DST_UTF_NAME corresponding to STRING in SRC_UTF_NAME.
Function names
A function name consists of SRC_UTF_NAME, a string '_to_', and DST_UTF_NAME. SRC_UTF_NAME and DST_UTF_NAME must be one in the list of hyphen-removed and lowercased names following:
unicode (for Perl internal Unicode encoding; see perlunicode) utf16le (for UTF-16LE) utf16be (for UTF-16BE) utf32le (for UTF-32LE) utf32be (for UTF-32BE) utf8 (for UTF-8) utf8mod (for UTF-8-Mod) utfcp1047 (for CP1047-oriented UTF-EBCDIC).
In all, 64 (i.e. 8 times 8) functions are available. Available function names
include utf16be_to_utf32le()
and utf8_to_unicode()
.
DST_UTF_NAME may be same as SRC_UTF_NAME like utf8_to_utf8()
.
Conversions where both SRC_UTF_NAME and DST_UTF_NAME begin at 'utf' are defined well and stably. In contrast to these UTF, the Perl internal Unicode encoding is influenced by the platform-dependent features (e.g. 32bit/64bit, ASCII/EBCDIC).
Parameters
If the first parameter is a reference, that is regarded as the CALLBACK. Any reference will not allowed as STRING. If CALLBACK is given, the second parameter is STRING; otherwise the first is. Currently, only code references are allowed as CALLBACK.
If CALLBACK is omitted, only Unicode scalar values (0x0000..0xD7FF
and 0xE000..0x10FFFF
) are allowed. Ill-formed octet sequences
(corresponding to a code point outside the range of Unicode scalar values)
and partial octets (which does not correspond to any code point) are deleted,
as if a code reference constantly returning an empty string,
sub {''}
, was used as CALLBACK.
Examples of partial octets: the first octet without following octets in UTF-8
like "\xC2"
; the last octet in UTF-16BE,LE with odd number of octets.
If CALLBACK is specified, the appearance of an ill-formed octet sequences or a partial octet calls the code reference. The first parameter for CALLBACK is the unsigned integer value of its code point; if the value is lesser than 256, that is a partial octet.
The return value from CALLBACK will be inserted there.
You may use chr_<DST_UTF_NAME>()
as CALLBACK (see below).
Return value from CALLBACK should be in UTF of DST_UTF_NAME.
You can call die
or croak
in CALLBACK when you want to stop
the operation if the whole STRING would not be well-formed.
(Exporting: use Unicode::Transform ':chr';
)
<DST_UTF_NAME>(CODEPOINT)
Returns a string in DST_UTF_NAME corresponding to CODEPOINT. CODEPOINT should be an unsigned integer. If CODEPOINT is outside the range of Unicode scalar values, a corresponding ill-formed octet sequence will be returned.
If CODEPOINT is greater than the maximum value, returns undef
.
The maximum value of CODEPOINT is:
0x0010_FFFF for chr_utf16le() and chr_utf16be() 0x7FFF_FFFF for chr_utf8(), chr_utf8mod(), chr_utfcp1047() 0xFFFF_FFFF for chr_utf32le(), chr_utf32be()
The maximum value of CODEPOINT for chr_unicode()
depends
on the platform features (e.g. 32bit/64bit, ASCII/EBCDIC).
Function names
The full list of functions provided:
chr_unicode(CODEPOINT)
chr_utf16le(CODEPOINT)
chr_utf16be(CODEPOINT)
chr_utf32le(CODEPOINT)
chr_utf32be(CODEPOINT)
chr_utf8(CODEPOINT)
chr_utf8mod(CODEPOINT)
chr_utfcp1047(CODEPOINT)
(Exporting: use Unicode::Transform ':ord';
)
<SRC_UTF_NAME>(STRING)
Returns an unsigned integer value of the first character of STRING
in SRC_UTF_NAME. STRING may begin at an ill-formed octet sequence
corresponding to a surrogate code point (0xD800..0xDFFF
)
or an out-of-range code point (0x110000
and greater). If STRING
is empty or begins at a partial octet, returns undef
.
Function names
The full list of functions provided:
ord_unicode(STRING)
ord_utf16le(STRING)
ord_utf16be(STRING)
ord_utf32le(STRING)
ord_utf32be(STRING)
ord_utf8(STRING)
ord_utf8mod(STRING)
ord_utfcp1047(STRING)
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C)
2002-2005, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Unicode::Transform - conversion among Unicode Transformation Formats |