Unicode::Normalize - Unicode Normalization Forms |
Unicode::Normalize - Unicode Normalization Forms
(1) using function names exported by default:
use Unicode::Normalize;
$NFD_string = NFD($string); # Normalization Form D $NFC_string = NFC($string); # Normalization Form C $NFKD_string = NFKD($string); # Normalization Form KD $NFKC_string = NFKC($string); # Normalization Form KC
(2) using function names exported on request:
use Unicode::Normalize 'normalize';
$NFD_string = normalize('D', $string); # Normalization Form D $NFC_string = normalize('C', $string); # Normalization Form C $NFKD_string = normalize('KD', $string); # Normalization Form KD $NFKC_string = normalize('KC', $string); # Normalization Form KC
Parameters:
$string
is used as a string under character semantics (see the perlunicode manpage).
$code_point
should be an unsigned integer representing a Unicode code point.
Note: Between XSUB and pure Perl, there is an incompatibility
about the interpretation of $code_point
as a decimal number.
XSUB converts $code_point
to an unsigned integer, but pure Perl does not.
Do not use a floating point nor a negative sign in $code_point
.
$NFD_string = NFD($string)
$NFC_string = NFC($string)
$NFKD_string = NFKD($string)
$NFKC_string = NFKC($string)
$FCD_string = FCD($string)
Note: FCD is not always unique, then plural forms may be equivalent
each other. FCD()
will return one of these equivalent forms.
$FCC_string = FCC($string)
Note: FCC is unique, as well as four normalization forms (NF*).
$normalized_string = normalize($form_name, $string)
$form_name
.
As $form_name
, one of the following names must be given.
'C' or 'NFC' for Normalization Form C (UAX #15) 'D' or 'NFD' for Normalization Form D (UAX #15) 'KC' or 'NFKC' for Normalization Form KC (UAX #15) 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
'FCD' for "Fast C or D" Form (UTN #5) 'FCC' for "Fast C Contiguous" (UTN #5)
$decomposed_string = decompose($string [, $useCompatMapping])
If the second parameter (a boolean) is omitted or false, the decomposition is canonical decomposition; if the second parameter (a boolean) is true, the decomposition is compatibility decomposition.
The string returned is not always in NFD/NFKD. Reordering may be required.
$NFD_string = reorder(decompose($string)); # eq. to NFD() $NFKD_string = reorder(decompose($string, TRUE)); # eq. to NFKD()
$reordered_string = reorder($string)
For example, when you have a list of NFD/NFKD strings, you can get the concatenated NFD/NFKD string from them, by saying
$concat_NFD = reorder(join '', @NFD_strings); $concat_NFKD = reorder(join '', @NFKD_strings);
$composed_string = compose($string)
For example, when you have a NFD/NFKD string, you can get its NFC/NFKC string, by saying
$NFC_string = compose($NFD_string); $NFKC_string = compose($NFKD_string);
($processed, $unprocessed) = splitOnLastStarter($normalized)
$processed
, is a part
before the last starter, and the second one, $unprocessed
is
another part after the first part. A starter is a character having
a combining class of zero (see UAX #15).
Note that $processed
may be empty (when $normalized
contains no
starter or starts with the last starter), and then $unprocessed
should be equal to the entire $normalized
.
When you have a $normalized
string and an $unnormalized
string
following it, a simple concatenation is wrong:
$concat = $normalized . normalize($form, $unnormalized); # wrong!
Instead of it, do like this:
($processed, $unprocessed) = splitOnLastStarter($normalized); $concat = $processed . normalize($form,$unprocessed.$unnormalized);
splitOnLastStarter()
should be called with a pre-normalized parameter
$normalized
, that is in the same form as $form
you want.
If you have an array of @string
that should be concatenated and then
normalized, you can do like this:
my $result = ""; my $unproc = ""; foreach my $str (@string) { $unproc .= $str; my $n = normalize($form, $unproc); my($p, $u) = splitOnLastStarter($n); $result .= $p; $unproc = $u; } $result .= $unproc; # instead of normalize($form, join('', @string))
$processed = normalize_partial($form, $unprocessed)
normalize()
and splitOnLastStarter()
.
Note that $unprocessed
will be modified as a side-effect.
If you have an array of @string
that should be concatenated and then
normalized, you can do like this:
my $result = ""; my $unproc = ""; foreach my $str (@string) { $unproc .= $str; $result .= normalize_partial($form, $unproc); } $result .= $unproc; # instead of normalize($form, join('', @string))
$processed = NFD_partial($unprocessed)
normalize_partial('NFD', $unprocessed)
.
Note that $unprocessed
will be modified as a side-effect.
$processed = NFC_partial($unprocessed)
normalize_partial('NFC', $unprocessed)
.
Note that $unprocessed
will be modified as a side-effect.
$processed = NFKD_partial($unprocessed)
normalize_partial('NFKD', $unprocessed)
.
Note that $unprocessed
will be modified as a side-effect.
$processed = NFKC_partial($unprocessed)
normalize_partial('NFKC', $unprocessed)
.
Note that $unprocessed
will be modified as a side-effect.
(see Annex 8, UAX #15; and DerivedNormalizationProps.txt)
The following functions check whether the string is in that normalization form.
The result returned will be one of the following:
YES The string is in that normalization form. NO The string is not in that normalization form. MAYBE Dubious. Maybe yes, maybe no.
$result = checkNFD($string)
1
) if YES
; false (empty string
) if NO
.
$result = checkNFC($string)
1
) if YES
; false (empty string
) if NO
;
undef
if MAYBE
.
$result = checkNFKD($string)
1
) if YES
; false (empty string
) if NO
.
$result = checkNFKC($string)
1
) if YES
; false (empty string
) if NO
;
undef
if MAYBE
.
$result = checkFCD($string)
1
) if YES
; false (empty string
) if NO
.
$result = checkFCC($string)
1
) if YES
; false (empty string
) if NO
;
undef
if MAYBE
.
Note: If a string is not in FCD, it must not be in FCC.
So checkFCC($not_FCD_string)
should return NO
.
$result = check($form_name, $string)
1
) if YES
; false (empty string
) if NO
;
undef
if MAYBE
.
As $form_name
, one of the following names must be given.
'C' or 'NFC' for Normalization Form C (UAX #15) 'D' or 'NFD' for Normalization Form D (UAX #15) 'KC' or 'NFKC' for Normalization Form KC (UAX #15) 'KD' or 'NFKD' for Normalization Form KD (UAX #15)
'FCD' for "Fast C or D" Form (UTN #5) 'FCC' for "Fast C Contiguous" (UTN #5)
Note
In the cases of NFD, NFKD, and FCD, the answer must be
either YES
or NO
. The answer MAYBE
may be returned
in the cases of NFC, NFKC, and FCC.
A MAYBE
string should contain at least one combining character
or the like. For example, COMBINING ACUTE ACCENT
has
the MAYBE_NFC/MAYBE_NFKC property.
Both checkNFC("A\N{COMBINING ACUTE ACCENT}")
and checkNFC("B\N{COMBINING ACUTE ACCENT}")
will return MAYBE
.
"A\N{COMBINING ACUTE ACCENT}"
is not in NFC
(its NFC is "\N{LATIN CAPITAL LETTER A WITH ACUTE}"
),
while "B\N{COMBINING ACUTE ACCENT}"
is in NFC.
If you want to check exactly, compare the string with its NFC/NFKC/FCC.
if ($string eq NFC($string)) { # $string is exactly normalized in NFC; } else { # $string is not normalized in NFC; }
if ($string eq NFKC($string)) { # $string is exactly normalized in NFKC; } else { # $string is not normalized in NFKC; }
These functions are interface of character data used internally. If you want only to get Unicode normalization forms, you don't need call them yourself.
$canonical_decomposition = getCanon($code_point)
undef
.
Note: According to the Unicode standard, the canonical decomposition of the character that is not canonically decomposable is same as the character itself.
$compatibility_decomposition = getCompat($code_point)
undef
.
Note: According to the Unicode standard, the compatibility decomposition of the character that is not compatibility decomposable is same as the character itself.
$code_point_composite = getComposite($code_point_here, $code_point_next)
If they are not composable, it returns undef
.
$combining_class = getCombinClass($code_point)
$may_be_composed_with_prev_char = isComp2nd($code_point)
$is_exclusion = isExclusion($code_point)
$is_singleton = isSingleton($code_point)
$is_non_starter_decomposition = isNonStDecomp($code_point)
$is_Full_Composition_Exclusion = isComp_Ex($code_point)
$NFD_is_NO = isNFD_NO($code_point)
$NFC_is_NO = isNFC_NO($code_point)
$NFC_is_MAYBE = isNFC_MAYBE($code_point)
$NFKD_is_NO = isNFKD_NO($code_point)
$NFKC_is_NO = isNFKC_NO($code_point)
$NFKC_is_MAYBE = isNFKC_MAYBE($code_point)
NFC
, NFD
, NFKC
, NFKD
: by default.
normalize
and other some functions: on request.
$Config{privlib}
/unicore/README.perl for details.
perl's version implemented Unicode version 5.6.1 3.0.1 5.7.2 3.1.0 5.7.3 3.1.1 (normalization is same as 3.1.0) 5.8.0 3.2.0 5.8.1-5.8.3 4.0.0 5.8.4-5.8.6 4.0.1 (normalization is same as 4.0.0) 5.8.7-5.8.8 4.1.0 5.10.0 5.0.0 5.8.9, 5.10.1 5.1.0 5.12.x 5.2.0 5.14.x 6.0.0 5.16.x 6.1.0 5.18.x 6.2.0 5.20.x 6.3.0 5.22.x 7.0.0
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Currently maintained by <perl5-porters@perl.org>
Copyright(C)
2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Unicode::Normalize - Unicode Normalization Forms |