When I have to convert data between formats, I reach for Perl. While many people think Perl's built in regular expressions make data munging easy, my experience is that Perl's multi-paradigm nature and dynamic programming flexibility are more important.

The Problem

I help run ClubCompy, a retro-inspired, zero-installation, browser-based programming environment designed to help children learn about computing. One of the reasons they recruited me is to design the educational components, including documentation. (I also know a few things about compilers and business.)

While ClubCompy has a surprising amount of power in its underlying virtual machine, that power is currently exposed in a programming language called Tasty—a mixture of 8-bit BASIC and Logo.

As with most systems which evolve from a simple idea into something else, following the law of opportunism, the project's structure and organization and tooling has accreted organically instead of following a rigid design. (Startup hackers: your job is to prune things when necessary until you discover the core of your business.) In particular, the documentation for the Tasty language exists in a series of OpenOffice files, one per language keyword.

The good news is that documentation exists. It's mostly complete, too: every keyword has documentation, and most of it is comprehensive. (Maybe 15 or 20% needs expansion, but we'll get there.)

The bad news is that the documentation exists in .odt files. They're not binary blobs, but they don't fit with our publishing system: they're too difficult to convert to clean PDF or very clean HTML for use throughout the system. They're also a mess when checked into source control.

Monday I decided to convert them to POD. (ClubCompy uses the Onyx Neon publishing toolchain designed for things like Modern Perl: the book. Everything not yet available on the CPAN is available from my Github account.)

Inside ODT Files

An OpenOffice .odt file is a zipped archive of several other files. Fortunately, there's only one file I care about and very fortunately, it's a reasonably self-contained XML file. Getting the contents of content.xml is easy with a little bit of Archive::Zip code:

use Archive::Zip; sub get_xml_contents { my $file = shift; my $zip = Archive::Zip->new( $file ); my $content = $zip->memberNamed( 'content.xml' ); return $content->contents; }

All of the Tasty keywords follow a standard template for documentation. This is both good and bad. It's good that discovering out how OpenOffice represents each unique element in XML is relatively easy: figure it out once and that representation should apply to all files. It's bad that the documentation template didn't use custom semantic styles, like "Top-level Header" and "Program Code".

That means all of the styles are ad hoc:

<office:automatic-styles> <style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard"> <style:paragraph-properties fo:background-color="#666699"> <style:background-image /> </style:paragraph-properties> <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="14pt" fo:font-weight="bold" style:font-size-asian="14pt" style:font-weight-asian="bold" style:font-size-complex="14pt" style:font-weight-complex="bold" /> </style:style> <style:style style:name="P2" style:family="paragraph" style:parent-style-name="Standard" style:master-page-name=""> <style:paragraph-properties fo:margin-left="0.2602in" fo:margin-right="0in" fo:text-indent="0in" style:auto-text-indent="false" style:page-number="auto" fo:background-color="#9999cc"> <style:background-image /> </style:paragraph-properties> <style:text-properties fo:color="#ffffff" style:font-name="Calibri" fo:font-size="12pt" fo:font-weight="bold" style:font-size-asian="12pt" style:font-weight-asian="bold" style:font-size-complex="12pt" style:font-weight-complex="bold" /> </style:style> ... </office:automatic-styles>

I'll explain that more later.

The actual text of each file resembles:

<office:body> <office:text> <text:sequence-decls> <text:sequence-decl text:display-outline-level="0" text:name="Illustration" /> <text:sequence-decl text:display-outline-level="0" text:name="Table" /> <text:sequence-decl text:display-outline-level="0" text:name="Text" /> <text:sequence-decl text:display-outline-level="0" text:name="Drawing" /> </text:sequence-decls> <text:p text:style-name="P1">Keyword</text:p> <text:p text:style-name="P9">WHILE<text:span text:style-name="T1">-DO</text:span> </text:p> <text:p text:style-name="P8"> <text:span text:style-name="T2">END</text:span> </text:p> ... </office:text> </office:body>

All of the text of the documentation is available under <text:p> tags.

Extracting Text

Extracting this text is a job for XPath. While I could get more specific with the XPath expression (find all direct children of <office:text> ), I went for the simple solution at first:

use XML::XPath; use XML::XPath::XMLParser; sub rewrite_xml { my $contents = shift; my $xpath = XML::XPath->new( xml => $contents ); set_methods_for_styles( get_xml_style_methods( $xpath ) ); my $pod = xml_to_pod( $xpath ); clear_methods_for_styles(); return $pod; } sub xml_to_pod { my $xpath = shift; my $nodeset = $xpath->find( '//text:p' ); my $pod; for my $node ($nodeset->get_nodelist) { my $style = $node->getAttribute( 'text:style-name' ); $style = 'Empty' if @{ $node->getChildNodes } == 0; my $method = get_method_for_style( $style ); $pod .= $node->$method; } return $pod; }

Ignore the get_method_for_style() calls for now. The important part of xml_to_pod is that it finds these tags in the XML and performs an action on each of them.

What's that action? Transforming it to POD, of course.

Look in the sample XML again. Each of the paragraphs has an associated style tag. That style refers to one of the styles declared earlier in that file. Given the name of a style, the body of the loop finds the name of a method and calls that method to transliterate the contents of that tag to POD.

Transliterating to POD

Here's where the power of Perl really shines. Every node in that nodeset is an instance of XML::XPath::Node::Element. That class knows nothing about POD. At least, it knows nothing about POD until I declared some methods in it:

package XML::XPath::Node::Element; sub kidsToPod { join '', map { $_->toPod } shift->getChildNodes } sub toPod { my $self = shift; my ($name) = $self->getName =~ /text:(\w+)/; my $method = 'toPodFor' . ucfirst $name; return $self->$method; } sub toPodForEmpty { '' } sub toPodForS { ' ' } sub toPodForTab { ' ' } sub toPodForSpan { my $self = shift; my $style = $self->getAttribute( 'text:style-name' ) // ''; $style = 'Empty' if @{ $self->getChildNodes } == 0; my $method = main::get_method_for_style( $style ); return $self->$method; } sub toPodForBold { 'B<' . shift->kidsToPod . '>' } sub toPodForCode { 'C<' . shift->kidsToPod . '>' } sub toPodForCodePara { ' ' . shift->kidsToPod . "

" } sub toPodForItalic { 'I<' . shift->kidsToPod . '>' } sub toPodForPlain { shift->wrapKids( '', '' ) } sub toPodForPlainPara { shift->wrapKids( '', '' ) . "



" } sub toPodForBoldCode { 'C<B<' . shift->kidsToPod . '>>' } sub toPodForBoldCodePara { 'C<B<' . shift->kidsToPod . ">>

" } sub toPodForHead0 { shift->wrapKids( '=head0 ', "



" ) } sub toPodForHead1 { shift->wrapKids( '=head1 ', "



" ) } sub toPodForHead2 { shift->wrapKids( '=head2 ', "



" ) } sub wrapKids { my ($self, $pre, $post) = @_; my $kid_text = $self->kidsToPod; return '' unless $kid_text; return $pre . $kid_text . $post; }

Because Perl has open classes, you can add methods to classes (or redefine methods) any time you want. Because Perl has dynamic method dispatch, you can use a string as the name of a method to call.

You can see that this code gets a little bit messy here. That's part and parcel of the tree transformation technique central to compilers; the real world is messy, and that mess has to go somewhere.

The wrapKids() method handles the case where one of these nodes has no textual content but does have a specific style. Given a snippet of documentation like:

Example 1: 10 x = 0 20 WHILE x LT 26 DO 30 PRINT TOCHAR x + 65 40 x = x + 1 50 END RUN (prints ABCDEFGHIJKLMNOPQRSTUVWXYZ)

... the blank line between RUN and the output is a unique paragraph with the monospace font applied. A naïve output from one of these methods might produce the POD C<> for that line. wrapKids() prevents that.

This open class approach works very well. It scales well too in terms of complexity. Even if this code eventually migrates to build a POD document model (see Pod::PseudoPod::DOM), giving individual nodes the responsibility of emitting a tree or text moves the custom behavior to where it most belongs.

(The benefit of a DOM is that basic tree transformation rules can take care of pruning out unnecessary elements, such as the blank code line.)

The Little Details

The XML::XPath::Node::Element s may nest, but you can see how that nesting works just fine through the toPod() method. Those ::Element classes may themselves also contain XML::XPath::Node::Text instances as children. These objects represent plain text.

So far, I've only found one situation where this plain text needs any manipulation. Adding one method fixes this:

package XML::XPath::Node::Text; sub toPod { my $raw_text = HTML::Entities::decode_entities( shift->toString ); return main::encode_pod( $raw_text ); }

The encode_pod() function (it's in main so as not to make it available as a method inadvertently) is:

use Regexp::Assemble; my %escapes = ( '<' => 'E<lt>', '>' => 'E<gt>', ); sub encode_pod { state $replace = make_regexp( \%escapes ); my $text = shift; $text =~ s/($replace)/$escapes{$1}/g; return $text; } sub make_regexp { my $escapes = shift; my $ra = Regexp::Assemble->new; $ra->add( $_ ) for keys %$escapes; return $ra->re; }

More robust solutions exist, but so far this is all I've needed.

I do admit that the implementation is a little messy in places. That's one of the problems with this compiler technique: sometimes you have data that needs to be available everywhere but you don't want to pass it as arguments everywhere and you don't want to wrap up everything in intermediary objects because you're already using perfectly good objects from elsewhere.

I haven't shown the code which identifies styles and makes the hash of style name to output method yet; that's for the next post. I'm sure you can start to figure out how it works already.