Unicode::String - String of Unicode characters |
Unicode::String - String of Unicode characters (UTF-16BE)
use Unicode::String qw(utf8 latin1 utf16be);
$u = utf8("string"); $u = latin1("string"); $u = utf16be("\0s\0t\0r\0i\0n\0g");
print $u->utf32be; # 4 byte characters print $u->utf16le; # 2 byte characters + surrogates print $u->utf8; # 1-4 byte characters
A Unicode::String
object represents a sequence of Unicode
characters. Methods are provided to convert between various external
formats (encodings) and Unicode::String
objects, and methods are
provided for common string manipulations.
The functions utf32be(), utf32le(), utf16be(), utf16le(), utf8(),
utf7(), latin1(), uhex(), uchr()
can be imported from the
Unicode::String
module and will work as constructors initializing
strings of the corresponding encoding.
The Unicode::String
objects overload various operators, which means
that they in most cases can be treated like plain strings.
Internally a Unicode::String
object is represented by a string of 2
byte numbers in network byte order (big-endian). This representation
is not visible by the API provided, but it might be useful to know in
order to predict the efficiency of the provided methods.
The following class methods are available:
Unicode::String
objects are implicitly converted to and from plain
strings.
If an argument is provided it sets the current encoding. The argument should have one of the following: ``ucs4'', ``utf32'', ``utf32be'', ``utf32le'', ``ucs2'', ``utf16'', ``utf16be'', ``utf16le'', ``utf8'', ``utf7'', ``latin1'' or ``hex''. The default is ``utf8''.
The stringify_as()
method returns a reference to the current encoding
function.
Unicode::String
object. If an $initial_value argument is given, it
is decoded according to the specified stringify_as()
encoding, UTF-8
by default.
In general it is recommended to import and use one of the encoding specific constructor functions instead of invoking this method.
These methods get or set the value of the Unicode::String
object by
passing strings in the corresponding encoding. If a new value is
passed as argument it will set the value of the Unicode::String
,
and the previous value is returned. If no argument is passed then the
current value is returned.
To illustrate the encodings we show how the 2 character sample string of ``µm'' (micro meter) is encoded for each one.
Alternative names for this method are utf32()
and ucs4().
Alternative names for this method are utf16()
and ucs2().
If the string passed to utf16be()
starts with the Unicode byte order
mark in little endian order, the result is as if utf16le()
was called
instead.
If the string passed to utf16le()
starts with the Unicode byte order
mark in big endian order, the result is as if utf16le()
was called
instead.
The UTF-7 encoding only use plain US-ASCII characters for the encoding. This makes it safe for transport through 8-bit stripping protocols. Characters outside the US-ASCII range are base64-encoded and '+' is used as an escape character. The UTF-7 encoding is described in RFC 1642.
If the (global) variable $Unicode::String::UTF7_OPTIONAL_DIRECT_CHARS is TRUE, then a wider range of characters are encoded as themselves. It is even TRUE by default. The characters affected by this are:
! " # $ % & * ; < = > @ [ ] ^ _ ` { | }
Characters outside the ``\x00'' .. ``\xFF'' range are simply removed from
the return value of the latin1()
method. If you want more control
over the mapping from Unicode to ISO-8859-1, use the Unicode::Map8
class. This is also the way to deal with other 8-bit character sets.
The following methods are available:
Unicode::String
to a plain string according to the
setting of stringify_as(). The default stringify_as()
encoding is
``utf8''.
Unicode::String
to a number. Currently only the digits
in the range 0x30 .. 0x39 are recognized. The plan is to eventually
support all Unicode digit characters.
Unicode::String
to a boolean value. Only the empty
string is FALSE. A string consisting of only the character U+0030 is
considered TRUE, even if Perl consider ``0'' to be FALSE.
Unicode::String
where the content of $us is repeated
$count times. This operation is also overloaded as:
$us x $count
Unicode::String
object, then it is first
passed to the Unicode::String->new constructor function. This
operation is also overloaded as:
$us . $other_string
Unicode::String
object, then it is first
passed to the Unicode::String->new constructor function. This
operation is also overloaded as:
$us .= $other_string
Unicode::String
object. This
operation is overloaded as the assignment operator.
Unicode::String
. Surrogate pairs are
still counted as 2.
Unicode::String
object.
Unicode reserve the character U+FEFF character as a byte order mark. This works because the swapped character, U+FFFE, is reserved to not be valid. For strings that have the byte order mark as the first character, we can guaranty to get the byte order right with the following code:
$ustr->byteswap if $ustr->ord == 0xFFFE;
ord()
method deals with surrogate pairs, which gives us a result-range of
0x0 .. 0x10FFFF. If the $us string is empty, undef is returned.
substr()
function.
Unicode::String
object).
The following functions are provided. None of these are exported by default.
Unicode::String
objects. The provided argument should be
encoded correspondingly.
Unicode::String
object from a string of hex
values. See hex()
method above for description of the format.
Unicode::String
object from a
Unicode character code. This works similar to perl's builtin chr()
function.
the Unicode::CharName manpage, the Unicode::Map8 manpage
Copyright 1997-2000,2005 Gisle Aas.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Unicode::String - String of Unicode characters |