Unicode::Map8 - Mapping table between 8-bit chars and Unicode |
Unicode::Map8 - Mapping table between 8-bit chars and Unicode
require Unicode::Map8; my $no_map = Unicode::Map8->new("ISO646-NO") || die; my $l1_map = Unicode::Map8->new("latin1") || die;
my $ustr = $no_map->to16("V}re norske tegn b|r {res\n"); my $lstr = $l1_map->to8($ustr); print $lstr;
print $no_map->tou("V}re norske tegn b|r {res\n")->utf8
The Unicode::Map8 class implement efficient mapping tables between 8-bit character sets and 16 bit character sets like Unicode. The tables are efficient both in terms of space allocated and translation speed. The 16-bit strings is assumed to use network byte order.
The following methods are available:
If you omit the argument, then an empty mapping table is constructed.
You must then add mapping pairs to it using the addpair()
method
described below.
Consider the following example:
$m->addpair(0x20, 0x0020); $m->addpair(0x20, 0x00A0); $m->addpair(0xA0, 0x00A0);
It means that the character 0x20 and 0xA0 in the 8-bit charset maps to themselves in the 16-bit set, but in the 16-bit character set 0x0A0 maps to 0x20.
to8()
and recode8().
tou()
and recode8().
to16()
but return a Unicode::String object instead of a plain
UCS2 string.
The following callback methods are available. You can override these methods by creating a subclass of Unicode::Map8.
Example:
package MyMapper; @ISA=qw(Unicode::Map8); sub unmapped_to8 { my($self, $code) = @_; require Unicode::CharName; "<" . Unicode::CharName::uname($code) . ">"; }
The Unicode::Map8 constructor can parse two different file formats; a binary format and a textual format.
The binary format is simple. It consist of a sequence of 16-bit integer pairs in network byte order. The first pair should contain the magic value 0xFFFE, 0x0001. Of each pair, the first value is the code of an 8-bit character and the second is the code of the 16-bit character. If follows from this that the first value should be less than 256.
The textual format consist of lines that is either a comment (first non-blank character is '#'), a completely blank line or a line with two hexadecimal numbers. The hexadecimal numbers must be preceded by ``0x'' as in C and Perl. This is the same format used by the Unicode mapping files available from <URL:ftp://ftp.unicode.org/Public>.
The mapping table files are installed in the Unicode/Map8/maps directory somewhere in the Perl @INC path. The variable $Unicode::Map8::MAPS_DIR is the complete path name to this directory. Binary mapping files are stored within this directory with the suffix .bin. Textual mapping files are stored with the suffix .txt.
The scripts map8_bin2txt and map8_txt2bin can translate between these mapping file formats.
A special file called aliases within $MAPS_DIR specify all the alias names that can be used to denote the various character sets. The first name of each line is the real file name and the rest is alias names separated by space.
The `umap --list
' command be used to list the character sets
supported.
Does not handle Unicode surrogate pairs as a single character.
umap(1), the Unicode::String manpage
Copyright 1998 Gisle Aas.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Unicode::Map8 - Mapping table between 8-bit chars and Unicode |