String::Multibyte - manipulation of multibyte character strings |
String::Multibyte - manipulation of multibyte character strings
use String::Multibyte;
$utf8 = String::Multibyte->new('UTF8'); $utf8_len = $utf8->length($utf8_str);
This module provides some functions which emulate
the corresponding CORE
functions
for locale-independent manipulation of multiple-byte character strings.
Why this module is locale-independent?
Well, because this module only consider the byte sequence structure
of charsets and is not aware of any Locale stuff!
Locale-dependent methods like uc()
, lc()
, etc.,
will not be supported at all.
The definition files are sited
under the directory where String::Multibyte is sited.
E.g. if String::Multibyte is perl/site/lib/String/Multibyte.pm
,
copy String::Multibyte::Foo as perl/site/lib/String/Multibyte/Foo.pm
.
The definition file must return a hashref, having key(s)
named as following.
charset
'charset'
stands for a string of the charset name.
In almost case, omission of the 'charset'
matters very little,
but keep them not conflict among another charset.
regexp
'regexp'
, REQUIRED, is a regular expression
that matchs a single character of charset in question.
(You may use qr//
if available.)
If the 'regexp'
is omitted, calling any method is croaked.
nextchar
'nextchar'
must be a coderef
that returns the next character to the specified character.
If the 'nextchar'
coderef is omitted, mkrange()
and strtr()
methods don't understand hyphen metacharacter for character ranges.
cmpchar
'cmpchar'
must be a coderef
that compares the specified two characters.
If the 'cmpchar'
coderef is omitted, mkrange
and strtr
functions don't understand reverse character ranges.
hyphen
'hyphen'
is a character to stand for
a character range. The default is '-'
.
escape
'escape'
is an escape character
for a hyphen
character. The default is '\\'
.
The 'escape'
character is valid only before a hyphen
or another 'escape'
(e.g. '\\\\-]'
means '\\'
to ']'
;
'\\\\\-]'
means '\\'
, '-'
, and ']'
).
If an 'escape'
character is followed by any character
other than 'escape'
or 'hyphen'
, it is parsed literally.
$mbcs = String::Multibyte->new(CHARSET)
$mbcs = String::Multibyte->new(CHARSET, VERBOSE)
CHARSET
is the charset name; exactly speaking,
the file name of the definition file (without the suffix .pm).
It returns the instance to tell methods in which charset
the specified strings should be handled.
CHARSET
may be a hashref; this is how to define a charset
without any .pm file.
# see perlfaq6 :-) my $martian = String::Multibyte->new({ charset => "martian", regexp => '[A-Z][A-Z]|[^A-Z]', });
If true value is specified as VERBOSE
,
the called method (excepting islegal
) will check its arguments
and carps if any of them is not legally encoded.
Otherwise such a check won't be carried out
(saves a bit of time, but unsafe, though you can use
the islegal
method if necessary).
$mbcs->islegal(LIST)
LIST
.
$mbcs->length(STRING)
$mbcs->strrev(STRING)
$mbcs->index(STRING, SUBSTR)
$mbcs->index(STRING, SUBSTR, POSITION)
SUBSTR
in STRING
at or after POSITION
.
If POSITION
is omitted, starts searching
from the beginning of the string.
If the substring is not found, returns -1
.
$mbcs->rindex(STRING, SUBSTR)
$mbcs->rindex(STRING, SUBSTR, POSITION)
SUBSTR
in STRING
at or after POSITION
.
If POSITION
is specified, returns the last
occurrence at or before that position.
If the substring is not found, returns -1
.
$mbcs->strspn(STRING, SEARCHLIST)
$mbcs->strspn("+0.12345*12", "+-.0123456789"); # returns 8.
If the specified string does not contain any character
in the search list, returns 0
.
The string consists of characters in the search list, the returned value equals the length of the string.
SEARCHLIST
can be an ARRAYREF
.
e.g. if a charset treats CRLF
as a single character,
"\r\n"
is a one-element list of only "\r\n"
.
A two-element list of "\r"
and "\n"
can be
given as ["\r", "\n"]
(of course "\n\r"
is also ok
since the character order of SEARCHLIST
doesn't matter in strspn
).
$mbcs->strcspn(STRING, SEARCHLIST)
If the specified string does not contain any character in the search list, the returned value equals the length of the string.
SEARCHLIST
can be an ARRAYREF
.
e.g. if a charset treats CRLF
as a single character,
"\r\n"
is a one-element list of only "\r\n"
.
A two-element list of "\r"
and "\n"
can be
given as ["\r", "\n"]
(of course "\n\r"
is also ok
since the character order of SEARCHLIST
doesn't matter in strcspn
).
$mbcs->substr(STRING or SCALAR REF, OFFSET)
$mbcs->substr(STRING or SCALAR REF, OFFSET, LENGTH)
$mbcs->substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
CORE::substr
, but
using character semantics of multibyte charset encoding.
If the REPLACEMENT
as the fourth argument is specified, replaces
parts of the SCALAR
and returns what was there before.
You can utilize the lvalue reference, returned if a reference of scalar variable is used as the first argument.
${ $mbcs->substr(\$str,$off,$len) } = $replace;
works like
CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not multibyte-aware, then successive assignment may lead to odd results.
$mbcs->strsplit(SEPARATOR, STRING)
$mbcs->strsplit(SEPARATOR, STRING, LIMIT)
CORE::split
, but splits on the SEPARATOR
string,
not by a pattern.
If not in list context, only return the number of fields found,
but does not split into the @_
array.
If empty string is specified as SEPARATOR
, splits the specified string
into characters.
$bytes->strsplit('', 'This is perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's perl.')
$mbcs->mkrange(CHARLIST, ALLOW_REVERSE)
The result depends on the the character order for the concerned charset. About the character order for each charset, see its definition file.
If the character order is undefined in the definition file, returns an identical string with the specified string.
A character range is specified with a hyphen ('-'
, but exactly
speaking, $obj->{hyphen}
).
The backslashed combinations '\-'
and '\\'
(exactly speaking, "$obj->{escape}$obj->{hyphen}"
and "$obj->{escape}$obj->{escape}"
) are used
instead of the characters '-'
and '\'
, respectively.
The hyphen at the beginning or the end of the range
is also evaluated as the hyphen itself.
For example, $mbcs->mkrange('+\-0-9A-F')
returns
('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
'A', 'B', 'C', 'D', 'E', 'F')
and scalar $mbcs->mkrange('A-P')
returns 'ABCDEFGHIJKLMNOP'
.
If true value is specified as the second argument,
reverse character ranges such as '9-0'
, 'Z-A'
are allowed.
$bytes = String::Multibyte->new('Bytes'); $bytes->mkrange('p-e-r-l', 1); # ponmlkjihgfefghijklmnopqrqponml
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
$mbcs->strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
If a reference of scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
If 'h'
modifier is specified, returns a hash of histogram in list context;
a reference to hash of histogram in scalar context;
SEARCHLIST and REPLACEMENTLIST
Character ranges (internally utilizing mkrange()
) are supported.
If the REPLACEMENTLIST
is empty (specified as ''
, not undef
,
because the use of uninitialized value causes warning under -w option),
the SEARCHLIST
is replicated.
If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
SEARCHLIST
and REPLACEMENTLIST
can be an ARRAYREF
.
e.g. if a charset treats "\r\n"
(CRLF
) as a single character,
"\r\n"
is a one-element list of only "\r\n"
.
A two-element list of "\r"
and "\n"
should be
given as ["\r", "\n"]
. Of course "\n\r"
is also ok
but the character order is different;
cf. strtr($str, ["\r", "\n"], ["\n", "\r"])
that swaps "\n"
and "\r"
.
Each elements of ARRAYREF
can include character ranges
(the modifiers R
and r
affect their evaluation as usual).
["A-C", "h-z"]
is evaluated like "A-Ch-z"
if charset
does not include grapheme "Ch"
.
The former prevents "C"
and "h"
from evaluation as "Ch"
even if the charset
included grapheme "Ch"
.
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. h Return a hash (or a hashref) of histogram. R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally.
If 'R'
modifier is specified, '-'
is not evaluated as a meta character
but hyphen itself like in tr'''
. Compare:
$mbcs->strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI"
$mbcs->strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If 'r'
modifier is specified, reverse character ranges are allowed. e.g.
$mbcs->strtr($str, "0-9", "9-0", "r")
is equivalent to
$mbcs->strtr($str, "0123456789", "9876543210")
Caching the conversion table
If 'o'
modifier is specified, the conversion table is cached internally.
e.g.
foreach (@source_strings) { print $mbcs->strtr($_, $from_list, $to_list, 'o'); }
will be almost as efficient as this:
$trans = $mbcs->trclosure($from_list, $to_list);
foreach (@source_strings) { print &$trans($_); }
You can use whichever you like.
Without 'o'
,
foreach (@source_strings) { print $mbcs->strtr($_, $from_list, $to_list); }
will be very slow since the conversion table is made whenever the function is called.
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST)
$mbcs->trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
my $trans = $mbcs->trclosure($from_list, $to_list); print &$trans ($string); # ok to perl 5.003 print $trans->($string); # perl 5.004 or better
The functionality of the closure made by trclosure()
is equivalent
to that of strtr()
. Frankly speaking, the strtr()
calls
trclosure()
internally and uses the returned closure.
SEARCHLIST
and REPLACEMENTLIST
can be an ARRAYREF
same as strtr()
.
$[
$[
is always equal to 0
, never 1
.
In a grapheme-aware manipulation, notice that the beginning and the end of a string always lie on a grapheme boundary.
E.g. imagine a grapheme set where a grapheme comprises either a leading latin capital letter followed by one or more latin small letters, or a single byte. Such a set can be define as below.
$gra = String::Multibyte->new({ regexp => '[A-Z][a-z]*|[\x00-\xFF]', });
Think about $gra->index("Perl", "Pe")
.
As both "Perl"
and "Pe"
are a single grapheme,
they are not equal to each other.
So the result of this must be -1
(meaning no match).
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C)
2001-2015, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
perl(1).
String::Multibyte - manipulation of multibyte character strings |