[% setvar title C and C %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/


use unicode::representation and no unicode


  Maintainer: Simon Cozens <simon@brecon.co.uk>
  Date: 25 Sep 2000
  Mailing List: perl6-internals@perl.org
  Number: 300
  Version: 1
  Status: Developing


Perl 5.6's use bytes is a useful pragma; this RFC stipulates its intended behaviour in Perl 6 right now.


When Perl 5.6 introduced Unicode support, there suddenly became two ways of handling data: you can handle it as a series of Unicode characters, or you can manipulate it byte-for-byte. The use bytes pragma was used to force byte-for-byte manipulation.

The problem with this was mainly of understanding; you had to know when Perl was handling your data as UTF8 before use bytes made any sense, and then it looked like you were letting people fiddle with the internal representation of data, something that should be hidden from them.

This was only really a problem because some data was encoded and some wasn't. If we're using a consistent Unicode representation internally, and that's known and documented, the root of the problem goes away; since everything's going to be converted to UTF8 when it enters Perl, it makes sense to want to get at that UTF8. It's considerably less messy now.

But why? Why would someone want to get at the UTF8? One reason, which I believe to be sufficient enough to justify the pragma, noted in "Normalisation and unicode::exact" was to compare representations without decomposition. unicode::exact gives you non-destructive comparisons, but it doesn't give you byte-for-byte comparisons. use bytes would do this. Other uses which have been noted on p5p recently include XML and sending data over the network.

However, I want to keep all the Unicode-related pragmata consistently named, so I suggest that use bytes be renamed to use unicode::representation; use bytes appeared to be a rather confusing name anyway.

As part of the "What does use bytes mean?" discussion/debacle/holy war, Nick Ing-Simmons expressed the desire for a new pragma which turned off Unicode altogether. Let's put that in, and call it no unicode.


The exact semantics of no unicode will be trashed out in "Unicode Combinatorix"; use unicode::representation can be implemented similarly to use bytes.



the as-yet-unsubmitted RFCs on

RFC 295: Normalisation and unicode::exact

RFC ??: When UTF8 Leaks Out

RFC 312: Unicode Combinatorix

RFC 311: Line Disciplines