Independently Parsing Perl

A few years into my programming career, I found myself involved in a somewhat unusual web project for an enormous global IT company. Due to some odd platform issues, we could write the intranet half of the project only in Perl and the almost-identical public internet half only in Java.

In my efforts to pick up enough Java to help my Perl code interoperate with the code from the Java guys, I stumbled on a relatively new editor with the rather expansive name of JetBrains IntelliJ IDEA.

What a joy! It quite simply made learning Java an absolute pleasure, with comprehensive tab completion, light and simple API docs, easy exploration of the mountain of Java classes, and unobtrusive indicators showing me my mistakes and offering to fix them. In short, it had lots of brains and a fast, clean user interface.

Where Is IntelliPerl?

Although I only needed it heavily for a few months, it’s been my gold standard ever since, and my definition of what a “great” editor should be. I install every new Perl editor and IDE I come across in the hope that Perl might one day get an editor half as good as what Java has had for years.

These great editors are spreading. Java is now up to one and a half (Eclipse is nearly great but still seems not quite “effortless” enough about what it does). Dreamweaver gave HTML people their great editor years ago, and I’ve heard that Python may now have something that qualifies.

Interestingly, these great editors seem to share one major thing in common.

How to Build a Great Editor

Rather than relying on the language’s parser to examine code, great editors seem to implement special parsers of their own. These parsers treat a file less like code and more like a generic document (that just also happens to be code).

It’s a key distinction, and one that provides two critical capabilities.

First, it creates a “round-trip” capability, parsing a file into an internal model and back out again without moving a single white space character out of place. Even if parts of a file are broken or badly formatted, you can still change other parts of the file and save it correctly without it changing anything you don’t alter.

Second, it makes the parser extremely safe and error-tolerant. Any code open in an editor is there for a reason–generally because it isn’t finished yet, is broken, or needs changing. A document parser can hit a problem, flag it, stumble for a character or so until it finds something it recognizes, and then continue on.

Parsing as code is an entirely different task, and one often unsuited to these type of faults.

For example, take the following.

print "Hello World!

"; } MyModule->foobar;

For an editor using Perl itself to understand this code, it’s game over once it hits the naked closing brace, because the code is invalid. Without knowledge of what is below the brace, you lose all of the intelligence that needs the parser: syntax highlighting, module checking, helpful tips, the lot.

It’s just simply not a reasonable way to build an editor, where a file can be both unfinished and have dozens of bugs.

Building a Document Parser for Perl

Even without an editor to put it in (yet), a document parser for Perl would be extraordinarily useful for all sorts of tasks. At the time, though, all I really wanted was a really accurate HTML syntax highlighter.

Some time in early 2002, I was bored one afternoon and had a first stab at the problem. The result was pretty predictable, given patterns I’ve seen in others trying the same thing. It was A) based on regular expressions, and B) useless for anything even remotely interesting.

Between then and the start of The Perl Foundation grant in December 2004, I’ve spent a day or so a month on the problem, rewriting and throwing away code. I’ve junked two tokenizers, one lexer, an analysis package, three syntax highlighters, an obfuscation package, a quote engine, and half of the classes in the current object tree.

Now, finally, PPI is complete, bar some minor features and testing. It is 100 percent round-trip safe, and it’s been stress tested against the 38,000 (non-Acme) modules in CPAN, handling all but 28 of the most broken and bizarre.

What Does It Do?

PPI should be the basis for any task where you need to parse, analyze or manipulate Perl, and it finally provides a platform for doing these tasks to their full potential. This covers a huge range of possible tasks; far too many to cover in any depth here.

For this article, I want to demonstrate how PPI can improve existing tools that currently only do a very basic job, when there is the potential for so much more.

One of these is part of the PAR application-packaging module. When PAR bundles a module into the internal include directory, it tries to reduce the size of the modules by stripping out POD. Of course, what would be better would be to strip out everything that is excess and cut PAR file sizes even more.

This is a form of compression, but given the potential confusion in using something like “Compress::Perl” as a name, I’m picking my own term. I hereby anoint the term “Squish”. A squished module occupies as little space as possible, having had redundant characters removed. It will be extremely small, although it might look a little “squished” to look at :)

Perl::Squish

Rather than showing you the final project, I prefer to show the process of squishing a single module.

# Where is File::Spec on our system? use Class::Inspector; my $filename = Class::Inspector->resolved_filename( 'File::Spec' ); # Load File::Spec as a document use PPI; my $Document = PPI::Document->new( $filename );

Everything you do with PPI starts and finishes with PPI::Document objects. If you find yourself using the lexer directly, you are probably doing something wrong.

Where can I start cutting out the fat? For starters, many core modules have an __END__ section.

# Get the (one and only) __END__ section my $End = $Document->find_first( 'Statement::End' ); # Delete it from the document $End->delete if $End;

PPI provides a set of search methods that you can use on any element that has children. find_first is a safe guess, because there can only be one __END__ section. The search methods actually take &wanted functions like File::Find, so 'Statement::End' is really syntactic sugar for:

sub wanted { my ($Document, $Element) = @_; $Element->isa('PPI::Statement::End'); }

Of course, there’s a faster way to do the same thing. The prune method finds and immediately deletes all elements that match a particular condition.

# Delete all comments and POD $Document->prune( 'Token::Pod' ); $Document->prune( 'Token::Comment' );

For a more serious example, here’s how to strip the non-compulsory braces from ->method() :

# Remove useless braces $Document->prune( sub { my $Braces = $_[1]; $Braces->isa('PPI::Structure::List') or return ''; $Braces->children == 0 or return ''; my $Method = $Braces->sprevious_sibling or return ''; $Method->isa('PPI::Token::Word') or return ''; $Method->content !~ /:/ or return ''; my $Operator = $Method->sprevious_sibling or return ''; $Operator->isa('PPI::Token::Operator') or return ''; $Operator->content eq '->' or return ''; return 1; } );

It’s a little bit wordy, but is relatively straightforward to write. Just add conditions and discard as you go. You can get other elements, calculate anything or call sub-searches.

When you have finished, be sure to save the file.

# Save the file $Document->save( "$filename.squish" );

Wrapping It All Up

All you need to do now is is wrap it all up in some typical module boilerplate.

package Perl::Squish; use strict; use PPI; our $VERSION = '0.01'; # Squish a file in place # Perl::Squish->file( $filename ) sub file { my ($class, $file) = @_; my $Document = PPI::Document->new( $file ) or return undef; $class->document( $Document ) or return undef; $Document->save( $file ); } # Squish a document object # Perl::Squish->document( $Document ); sub document { my ($squish, $Document) = @_; # Remove the stuff we did earlier $Document->prune('Statement::End'); $Document->prune('Token::Comment'); $Document->prune('Token::Pod'); $Document->prune( sub { my $Braces = $_[1]; $Braces->isa('PPI::Structure::List') or return ''; $Braces->elements == 0 or return ''; my $Method = $Braces->sprevious_sibling or return ''; $Method->isa('PPI::Token::Word') or return ''; $Method->content !~ /:/ or return ''; my $Operator = $Method->sprevious_sibling or return ''; $Operator->isa('PPI::Token::Operator') or return ''; $Operator->content eq '->' or return ''; return 1; } ); # Let's also do some whitespace cleanup my @whitespace = $Document->find('Token::Whitespace'); foreach ( @whitespace ) { $_->{content} = $_->{content} =~ /

/ ? "

" : " "; } 1; } 1;

Finally, the last step is to wrap it all up as a proper module. You can see the finished product prettied up with PPI’s syntax highlighter at CPAN::Squish. I’ve added a few additional small features to the basic code described above, but you get the idea. See also Perl::Squish for more details.

In 15 minutes, I’ve knocked together a pretty simple module that dramatically improves on what you could do without something like PPI. Now imagine the hard things it makes possible.