[% setvar title Unicode Combinatorix %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/


Unicode Combinatorix


  Maintainer: Simon Cozens <simon@brecon.co.uk>
  Date: 25 Sep 2000
  Mailing List: perl6-internals@perl.org
  Number: 312
  Version: 1
  Status: Developing


How and when Unicode is used in Perl 6. Its sister RFC "When UTF8 Leaks Out" deals with how output is processed with respect to Unicode; this RFC deals with input and processing.


Here is an proposed overview of Perl 6's Unicode handling, except for all aspects of output.

Data comes into Perl in one of two methods: through a filehandle or socket, where line disciplines apply, or "any other method".

Data which comes in through a line discipline must be in UTF8, unless no unicode is in force.

Examples of data which enter Perl from "any other method" are environment variables, stuff coming out of backticks, and globs. It's assumed that these will come into Perl as ISO8859-1, and will be converted into UTF8 on entry. If the user wants to tell us that this data won't be ISO8859-1, then they may say use unicode::system 'discipline' where 'discipline' is any level 2 processing module. The data will be filtered through that module, and we will have UTF8 data at the end of this.

OK, at this stage, everything inside Perl is in Unicode. The next thing we need to do is to normalise it, and the RFC on normalisation covers that: basically, we either do this on demand or immediately. The setting of unicode::exact comes into play here.

Comparisons can be carried out in three ways: if unicode::representation is in force, then the bytes must be exactly equal. If unicode::exact is set, a normalised copy of the operands is made, and they are compared. Otherwise, the operands are normalised and compared. As the FAQ says, "Canonical equivalence matters".

Collation should take place according to the Unicode collation tables; if use locale is set, then the collation is localised as well. The Unicode locales RFC suggests other areas affected by locales, such as word and line breaking and Unicode character classes.

no unicode just throws everything. None of the above happens.


Just leave that to me...


"What Level of Support Should I Look For?", www.unicode.org

RFC 311: Line Disciplines

RFC 295 Normalisation and unicode::exact

RFC 300 use unicode::representation and no unicode

RFC ??: When UTF8 Leaks Out

RFC ??: Abstract Internals String Interaction

RFC ??: Unicode Locales