[% setvar title Modify open() to support FileObjects and Extensibility %]

This file is part of the Perl 6 Archive

Note: these documents may be out of date. Do not use as reference!

To see what is currently happening visit http://www.perl6.org/

TITLE

Modify open() to support FileObjects and Extensibility

VERSION

  Maintainer: Nathan Wiger <nate@wiger.org>
  Date: 4 Aug 2000
  Last Modified: 15 Sep 2000
  Mailing List: perl6-language-io@perl.org
  Number: 14
  Version: 4
  Status: Frozen

ABSTRACT

Currently, open(), opendir(), sysopen(), and other file open functions are given handle arguments, whose values are twiddled if a filehandle can be created:

   open(HANDLE, "<$filename");
   open PIPE, "| $program";
   open my $fh, "<$filename";
   opendir DIR, $dirname;

There are several problems with this approach:

   1. The calling style is uncharacteristic of other Perl funcs
   2. There is no way to support a list of return values
   3. There is no way to overload or extend them

In order to make these functions more consistent with other constructor like functions, they should be changed to instead return first-class fileobjects:

   $fo = open "<$filename" or die;
   $po = open "|$program" or die;
   $do = open dir $dirname or die;

This would make these functions more internally consistent within Perl, as well as allowing for the power of true fileobjects and an extensibile framework that can handle a multitude of different file types.

NOTES ON FREEZE

This RFC received an excellent response; almost surprisingly so, in fact. Most everyone agreed that the new open syntax was much more Perlish and flexible. And many were enthusiastic about basic fileobjects being so powerful. Some of the concepts - like URI support - were suggested by others and, once integrated, rounded out the proposal very nicely.

This proposal can stand on its own, but for best results should probably be coupled with:

   RFC 101: Handlers and Pseudo-classes
   RFC 239: Standardization of IO Functions to use Indirect Objects

Or at least read together, to get a little better overview of how these could be tied together.

DESCRIPTION

Overview

First, this RFC assumes that fileobjects will be $ single-whatzitz (thanks Tom) types, which seems to have reached an overall informal consensus.

As many have observed, the current filehandle mechanism is insufficient and largely "a historial accident". The goal of the redesign of file handles into full-fledged fileobjects is to make them as flexible and powerful as other objects within Perl, which still retaining a means to interact with them simply. Since we are redesigning filehandles to be true fileobjects, we should revise their constructor functions as well, returning fileobjects and providing extensibility.

As shown above, in the simplest case this would change the open() function to:

   $fo = open $filename or die;

If successful, open() and its relatives will return fileobjects. On failure, they will return undef. This still allows the user to test the return value of open() (as shown above) by checking for a "true" (fileobject) or "false" (undef) condition.

New Syntax of open()

The syntax of the open() function would be changed as follows:

   $fileobject, [ @params ] = open [ $handler ] $file, [ @args ];

Let's examine this syntax more closely:

   $fileobject  -  Replacement object for current filehandles.

   @params      -  Optional parameters that may be returned in a list
                   context. These may be things such as the owner for
                   a true file, or the content-type for a web document.

   $handler     -  The class from which to load the appropriate file
                   methods, the default being "file". This is just a
                   special type of pseudoclass. The handler type is
                   bound to a class given by the user, or taken from a
                   set of core methods. Think Apache handler.

                   This is simply called via the indirect object syntax
                   if it exists. For more details on handlers, see
                   RFC 101: Handlers and Pseudoclasses.

   $file        -  File to open. This might be a real file or directory,
                   but might also be a website, port for a socket,
                   ftp server, ipc pipe, rpc client, and so on.

   @args        -  Optional arguments to pass to the handler's C<open>.

The open function, as I propose it, is an overloaded and extensible function that differs from other constructors in that it returns a valid fileobject. This object can then be used in read, print, and other such file functions. As such, development of classes that can handle objects on new platforms (ex: Mac, Palm) and handle new types of files (XML documents, etc) is much quicker. Plus, these modules are much lighter weight since they don't have to reinvent the wheel every time.

Simple Scalar Form

In the simplest, "looks like Perl 5" form, open() can take one file parameter, which is then opened per the descriptor provided and the corresponding fileobject returned. Here are some examples (note that my has been left out for clarity):

   # Read from a file
   $fo = open "</etc/passwd" or die;
   print while (<$fo>);
   close $fo;

   # Write a file to a pipe
   $mailpipe = open "|/usr/lib/sendmail" or die;
   ($motd, $owner) = open "</etc/motd";       # return owner in list
   die unless $owner == 0;                    # owner not root
   while (<$motd>) {
      print $mailpipe;
   }
   close $motd;

   # Go fork yourself
   ($myself, $pid) = open "-|" or exec 'ls';  # return PID in list
   print while (<$myself>);
   close $myself;   # not myself anymore, hah! ;-)

In addition, the $file argument becomes optional in this new syntax. If not supplied, it defaults to $_, making it consistent with other Perl functions:

   for (@filenames) {
      my $fo = open;      
      push @open_handles, $fo;
   }

   # ... stuff happens ...

   for (@open_handles) {
      close;
   }

Perhaps this specific example is ugly (and useless), but there are probably other situations in which one could take advantage of this. Since $_ is just a string, it can contain a web doc, pipe, file, directory, or anything else you want.

True First-Class FileObjects

One major limitation of Perl's current filehandles is that they are bareword scalars, with no object properties or power. The redesign of simple filehandles into first-class fileobjects allows us to give them full object-oriented power, while still allowing them to be used in a simple manner as shown above. Each object can contain methods to allow us to access features of that fileobject much more efficiently.

Here are some proposed default accessor methods of fileobjects. Each of these would return the appropriate value, or undef if not available. This is a brief listing; the intent would be to support all of the current FileHandle methods and then some. In a string context, the filename as opened (including descriptors like | and <) is returned.

   $fo->STRING    -  Same as $fo->filename
   $fo->NUMBER    -  Same as $fo->fileno

   $fo->filename  -  Name of the file, web document, port, etc
   $fo->fileno    -  System file number
   $fo->type      -  One of 'PIPE', 'FILE', 'Class', etc, ala want()
   $fo->mode      -  Way the file was opened (|,<,>+,etc)
   $fo->dup       -  Returns a duplicate of the current B<fileobject>
   $fo->pid       -  Return current PID of the process (if pipe/fork)

In addition, these functions would allow you to modify key elements of the fileobject:

   $fo->autoflush -  Sets buffer flushing policy
   $fo->untaint   -  Removes tainting from that data source
   $fo->options   -  Some syscalls, like C<socket()>, allow you to set
                     options which affect the handling of C<$fo>

If we decide that fileobjects should be persistent across close() operations, we could define the following functions:

   $fo->open      -  Object methods to open/close C<fileobjects>
   $fo->close
   $fo->is_open   -  1 or undef, depending on the state of the object
   $fo->is_closed

Why would a fileobject be persistent across close() operations? If it contains lots of properties, it may be a waste if we simply want to close it to make sure buffers are flushed or bandwidth is not wasted on TCP connections. We could use close() to flush buffers and tidy things up, but not destroy the object until the end of the script or undef $fo was called (similar to FileHandle).

Extensible Handler Bindings

In addition to the standard file form, open() can also now take an optional handler name (the default is "file") from which to load the appropriate methods. This gives us easy access to methods that open Directory, Socket, HTTP, FTP, or other types of files, meaning we no longer have to start from the ground up every time we want to open a new type of "file".

Handlers are discussed in detail in RFC 101: Handlers and Pseudoclasses. However, since they are relevant to this RFC, we'll digress and discuss them a little.

handlers work much like Apache handlers. You specify a certain type of handler (for example, "file", "dir", "http", "ftp", and so on), which is really just a pseudoclass that inherits its methods from other classes. The only requirement is that the classes the handler inherits from return one of two values:

   1. A valid C<fileobject> which contains certain mandatory
      object methods (exact methods yet to be hammered out).

   2. undef if the file can't be handled, which allows stacked
      handlers.

As such, when a user specifies something like:

   # Open a dir
   $dir = open dir "/usr/bin";

Then the dir handler is called automatically by Perl's standard indirect object notation:

   $dir = dir->open("/usr/bin");

However, dir would not correspond directly to a class, but rather a handler. To paraphrase Tom Hughes's great explanation of this, the core should provide a way for modules to register themselves as being handlers. For example, you might use LWP::UserAgent, which would register as being a valid http handler. Or, you might use Net::FTP, which would register itself as being an ftp handler.

RFC 101: Handlers and Pseudoclasses, explores this in detail. Basically:

   use LWP::UserAgent;
   handler http => 'LWP::UserAgent';   # RFC 101

   $web = open http "www.yahoo.com", 'GET'; 

The handler line gives an example of what might happen. This might be specified by the user (as shown), or might be done automatically by loading modules under a certain part of the lib tree. This mechanism is considered fully in RFC 101.

These handlers are stackable. If a handler? returns undef, then the next one in line tries to open it. If no handler can open a specific file, then undef is returned from open to the user, consistent with current behavior.

For example, we might might stack dir handlers so that only certain users can look at the contents of "/usr/sbin", for example:

   use Unix::OnlyRootSeesUsrSbin;   
   handler dir => 'Unix::OnlyRootSeesUsrSbin';    # RFC 101

   $dir = open dir "/usr/sbin";     # not root? no /usr/sbin for you!
  

Here are some more examples which could be use to provide essentially native file access for many different media:

   # Open a file
   # Note "file" is the default class so doesn't have to be specified
   $motd = open ">/etc/motd" or die;
   print $motd @data;
   close $motd;

   # Open a directory 
   $dir = open dir "/usr/bin";
   @files = grep !/^\..*/, <$dir>;  # no more readdir
   close $dir;                      # no more closedir

   # Open a client socket with IO::Socket
   # By overloading < and > we can do clients and servers!
   $socket = open socket "< 25", PF_INET, SOCK_STREAM, TCP;
   @input = <$socket>;
   close $socket;
   do_something(@input);

   # Open a remote webpage
   $http = open http "www.perl.com", GET;
   @doc = <$http>;
   print @doc if $http->content_type eq 'text/html';
   close $http;

   # Open an ftp connection
   $ftp = open ftp "ftp.perl.com";
   $ftp->cwd('CPAN') or die;
   @files = <$ftp>;          # overloading as dir
   close $ftp;               # $ftp->close

The advantages to using such an extensible open are twofold:

   1. No more reinventing the wheel to support new file
      types on different platforms or networks.

   2. Easy integration so that a single syntax can address files,
      dirs, pipes, websites, ipc communications, sockets, rpc...

Assuming that these handlers all agree to return fileobjects with a consistent set of properties, this could lead to great optimizations since the structure of these fileobjects will be known ahead of time.

Custom File Methods

In addition, handlers can provide a custom set of read, print, and other functions that do different things from the core Perl set. So, a Unix/NT integration module might include a print method that delimits strings differently based on filesystem. Or, a module might provide a custom close method that does special cleanup before closing a file.

This is all accomplished automatically by Perl's indirect object notation. All of a given open module's functions would inherit from the core file methods. For example:

   package MyHttpOpen;
   
   sub open {
       # do stuff...
       return $new_http_fileobject;
   }

   sub print {
       # overrides CORE::print()
       # maybe it PUTs or POSTs data
   }

   sub readline {
       # overrides CORE::readline()
       # maybe it GETs or HEADs data
   }

Then, in your script, you would say:

   use MyHttpOpen;
   handler http => 'MyHttpOpen';

   $post = open http "upload.mydomain.com", POST;
   print $post @filedata;     # calls $post->print
   $reply = <$post>;          # calls $post->readline

In this example, the methods are automatically invoked via the wonderful indirect object syntax. However, to the user it appears just like everything's working normally, with the same set of core functions. It looks like Perl natively supports http protocol documents, but it doesn't have to, yielding great flexibility.

Embedded URI Support

This is something that many other languages, such as PHP, already support, and which would make Perl that much nicer to work with. With our new open, we have the ability to say:

   $htdoc = open "www.yahoo.com" or die;

And have Perl figure out to fork the http handler. This is actually not that hard. All that has to happen is CORE::open needs to read the file and see if it matches a URI pattern, namely:

   ($scheme, $scheme_data) = /^([a-z][a-z0-9.+-]*):(.*)$/i;

And then check to see if there is an existing class handler for that through a simple object call, namely something like:

   if ( $scheme->can("open") )

If the above returns true, then the open string is passed through to the handler to deal with. Otherwise, it is assumed that you want to open a file and the string is passed to the default file handler. Note that using this scheme we could extend URIs to anything we wanted. For example, PHP has a special php:// URI type that specifies certain internal actions. We might choose to specify:

  $process = open "ipc://$parent";   
  $socket  = open "socket://$hostname:$port";

However, this is not required, or even necessarily suggested, but just a demonstration of what could be done.

Note that some web mirroring applications use directories named for URLs. However, this is not difficult to work around. Using either one of these instantly disambiguates this situation:

  $htdoc = open file "www.yahoo.com" or die;
  $htdoc = open "< www.yahoo.com" or die;

The first one would automatically call the file handler via indirect object syntax. For the second one, CORE::open would see that it did not match the syntax for a URI and would pass it to the file handler as well.

Remember, any extra arguments are passed to the handler automatically, so to modify your http access method you would simply say:

  $htdoc = open "www.yahoo.com", 'GET' or die;
  $htdoc = open "www.yahoo.com", 'POST' or die;

Again, simple and quite Perlish.

IMPLEMENTATION

Syntax Changes

The CORE::open function would have to be altered to act more as a simple dispatch than anything. It could simply do some detecting of URIs then turn around and fork file->open for standard files. Or, perhaps the file handler should be builtin to CORE::open still, since this the most common operation.

The CORE::close function would remain basically unchanged, acting on the fileobject (or $_ if none is specified).

There are many other function changes that are somewhat tied to this RFC, but which this RFC does not require strictly, that should be considered. These are covered in detail in RFC 239.

Finally, I am in the process of constructing a Perl 5 prototype of this proposal; I will announce it to the list when it's complete.

Function Deletions

Because this syntax is flexible enough to handle any type of file opening, the following functions should be removed from Perl 6:

   # Replaced by 'open dir'
   opendir         # dir->open   instead
   readdir         # dir->read   instead
   closedir        # dir->close  instead
   seekdir         # dir->seek   instead
   rewinddir       # dir->rewind instead
   telldir         # dir->tell   instead

Notice how the indirect object syntax yields us the ability for function names to overlap, meaning that Perl appears to DWIM embedded file types:

   seek $FH;       # file->seek
   seek $DH;       # dir->seek
   seek $WEB;      # http->seek

When in fact it's a "simple" matter of parsing.

   # Replaced by 'open socket'
   socket          # socket->open
   setsockopt      # socket->options
   connect         # socket->connect
   close           # socket->close
   shutdown        # socket->shutdown   

For more details on these and other proposed changes to IO functions, please see RFC 239: Standardization of Perl IO Functions to use Indirect Objects, since that RFC is what truly addresses these changes.

Performance

In order to prevent performance hits, anything that is packaged as a default file type (such as files, pipes, directories, sockets, ipc, and so on) must be highly optimized for interaction with this new version of open(). Basically, anything living under the IO:: tree should be ripped apart and put back together again.

REFERENCES

RFC 101: Handlers and Pseudoclasses

RFC 239: Standardization of Perl IO Functions to use Indirect Objects

RFC 174: Improved parsing and flexibility of indirect object syntax

RFC 159: True Polymorphic Objects

RFC 33: Eliminate bareword filehandles

RFC 30: STDIN, STDOUT, and STDERR should be renamed

RFC 186: Standard support for opening i/o handles on scalars and arrays-of-scalars

ACKNOWLEDGEMENTS

Tom Christiansen's great analysis of file object methods

Tim Jenness's suggestion to use optimized IO objects for all I/O

Nick Ing-Simmon's suggestion to hide IO:: classes from the user

Tom Hughes's formalization of a way to register open() handlers

Jonathan Scott Duff, Jon Ericson, and Kai Henningsen's help with URIs

Chaim Frenkel's suggestion of a way to register classes in Perl 6

Everyone else on perl6-language-io, you guys are great! Thanks. :-)