Project Gutenberg Projects

Convert a treasure trove of Project Gutenberg novels to Markdown in preparation for thematic typesetting.

Introduction

I hope to give readers a thorough understanding of where presentation logic can creep into content and the ways that it can be avoided. If you’d rather dig into delicious programming bits, see the XHTML to Markdown section.

Project Gutenberg is so notoriously inconsistent that it could use a trigger warning for anyone who has an obsessive-compulsive personality disorder (ahem). This inconsistency means that automatic typesetting—reaching professional level—would be extraordinarily tedious, if not altogether impossible. That laborious overhead prevents showcasing some popular fiction in a variety of beautiful forms with great ease.

Fortunately, others have engaged in projects to ease typesetting prose from Project Gutenberg. Notably, these include HTML Writers Guild and Standard Ebooks. Both of these come with their own set of technical challenges, so let’s explore the some of the technical quagmires that caused clear losses in the battle of content versus presentation.

Normalization

A number of attempts were made to normalize—or suggest formats for normalizing—Project Gutenberg over the years, including:

Of all these projects, the most amenable to automatic typesetting are those produced by Standard Ebooks and HTML Writers Guild. The benefit of using HTML Writers Guild is their semantic markup and simple document type definition (DTD) file. Standard Ebooks, as the name suggests, are brilliantly standardized and have an excellent Manual of Style that describes what to expect from the XHTML.

At time of writing, the official XML documents listed on the Guild’s website are mostly unavailable. Thankfully, the Internet Archive, a random GitHub page, and a GitLab page contain copies of their work.

HTML Writers Guild

As great as they are, the HTML Writers Guide’s XML documents couple words to style in a few ways. This section describes the problems and proposes how to solve them. The issues are presented as shown in the original marked-up file, with only the document’s XML structure indented edited for readability.

By Lines and Subtitles

The “by line” is what I call the line that introduces the author’s name. Here we can spot a few problems:

<frontmatter> <titlepage> <title>OLIVER TWIST</title> <para> OR</para> <subtitle> THE PARISH BOY'S PROGRESS</subtitle> <para>BY </para> <author>CHARLES DICKENS</author> </titlepage> </frontmatter>

First, all element texts are capitalized. Although converting to Title Case is straightforward, formal names have edge conditions—such as possessives, van Dykes, or McLeods—that take some care. It is easier for a computer to convert mixed case to uppercase than the other way around. Arguably, how the case is presented is, well, presentation logic.

Second, there’s a linear organization for the title page, as signified by the OR and BY paragraphs that smack of presentation presumptions. Admittedly, the para elements can be ignored, but it leaves an unsettling feeling because we don’t know if an automated process is going to miss a key phrase inserted in a para that was meant to be inserted above or below a semantic element. Another way to mark up the document would be to prevent para elements within the titlepage element. Using an attribute—like alt for “alternative title”—to suggest the meaning of the subtitle then allows the presentation layer to insert the OR if desired:

<title>OLIVER TWIST</title> <subtitle alt="true">THE PARISH BOY'S PROGRESS</subtitle> <author>CHARLES DICKENS</author>

Third, BY by itself is redundant (we all know that authors write books), repetitious (yes, all books), presentation, and not metadata. Whether the author’s name is introduced with “by” is logic that belongs outside the book’s content.

The XML fragment with the problems resolved resembles:

<frontmatter> <titlepage> <title>Oliver Twist</title> <subtitle alt="true">The Parish Boy's Progress</subtitle> <author>Charles Dickens</author> </titlepage> </frontmatter>

Page Numbers

In a printed book or preformatted eBook, page numbers are incredibly useful. Within a plain text file, however, page numbers interfere with automatic typesetting because factors that affect the page count—page dimensions, font sizes, chapter sink, and margins—are not yet realized.

Some of the texts embed page numbers within the text, such as the following snippet from Moby Dick:

reveries. Some leaning against the spiles; some seated upon the pier-heads; some looking over the bulwarks glasses! .. <p 2 > of ships from China; some high aloft in the rigging, as if striving to get a still better seaward peep. But these are all landsmen; of week days pent up

Now we have three problems, grammar notwithstanding. First, there’s no way to tell whether the text after the page number should join with the text before the page number, such as at a paragraph boundary. Second, while computers are exceptional at counting, humans will make many data entry errors, such as double-counted pages:

.. <p 104 > Mr. Flask --good-bye, and good luck to ye all --and this day three years I'll <!-- skipped for brevity --> heavy-hearted cheers, and blindly plunged like fate into the lone Atlantic. .. <p 104 >

Third, the page numbers themselves were formatted inconsistently, which would have to be taken into account when writing a regular expression to eliminate the numbers and join the ~566 paragraphs together:

.. <p 109n. > See subsequent chapters for something more on this head. .. <p 110n. > See subsequent chapters for something more on this head. .. <p 110 >

Breathe. Remember to breathe. Moby Dick cannot be typeset automatically without extensive edits by a human, such as those made to produce the Standard Ebooks version.

Table of Contents

A nice feature included in the XML versions is that the tables of contents have been normalized in toc elements like the following:

<toc> <title>CONTENTS</title> <subtitle>Book the First--Recalled to Life</subtitle> <item>Chapter I The Period</item>

You know where this is going, though: all of the toc elements contain presentation logic that is also duplicated within the text. The following markup within the text body sheds light on the subtitle / title repetition:

<bookbody> <part> <titlepage> <title>Book the First--Recalled to Life</title> </titlepage>

For our purposes, we’ll ignore the toc element because the typesetting engine will recreate it automatically from the chapter headings.

Chapter Numbers

Chapters in these files resemble the following:

<chapheader> <chapnum>I</chapnum> <title>The Period</title> </chapheader>

Computers really do excel at counting. Whether to use Roman, Arabic, or Egyptian numerals is a design decision. We can safely ignore the chapnum element.

The DTD could be changed to suggest a numeral style that captures how the original publication was printed, which would cleanly separate concerns in a machine-readable fashion:

<chapheader numeral="roman"> <title>The Period</title> </chapheader>

Capitalization

In Huckleberry Finn, Robinson Crusoe, The Red Badge of Courage, and other texts, sometimes the first word of a chapter was entered in uppercase. Sometimes words within paragraphs were added in uppercase for emphasis, like “through” in Tom Sawyer or A Tale of Two Cities.

In the latter case, modern typesetting would prefer to use italics or bold to make the words stand out. In the former, it is the job of the presentation layer, be it cascading-style sheets, ConTeXt setups, LaTeX packages, or custom SILE extensions.

To clarify with an example from Tom Sawyer:

<para> SATURDAY morning was come, and all

Becomes:

<para> Saturday morning was come, and all

Paragraphs

When writing in plain text, applying word wrap, line breaks, and paragraph breaks consistently can be difficult for the uninitiated. For the most part, the HTML Writers Guild made wonderfully consistent and machine-parsable paragraphs. Inside The Insidious Dr. Fu Manchu, are lines that cannot be parsed into paragraphs:

<para> "That will do," said Smith, and I thought I detected a note of triumph in his voice. "But stay! Take us through to the back of the house." </para>

Fortunately, it looks like the entire file has been double-spaced consistently, so it would be simple enough to fix with the following regular expression applied using vim:

:%s/



/\r/g

This forces all lines in a paragraph to immediately following each other without an intermediary blank line:

<para> "That will do," said Smith, and I thought I detected a note of triumph in his voice. "But stay! Take us through to the back of the house." </para>

Moby Dick suffers from this affliction inconsistently. Moreover, multiple paragraphs are embedded within a single para element, in violation of the one paragraph per para element rule. Once again, the Standard Ebooks version provides a cleaner semantic markup:

<p>“So it is, so it is; if we get it.”</p> <p>“I was speaking of the oil in the hold, sir.”</p>

Quotes

Using entities (such as paired left- and right-double quotes) allows complex nested quotes to be typeset unambiguously:

<para> “Violet said, ‘Rose yelled, “I'm cybed!” in elation,’” said Redd. </para>

This would produce:

“Violet said, ‘Rose yelled, “I’m cybed!” in elation,’” said Redd.

Most of the texts embed curly quotes directly into the text.

We could get a jump on burgeoning commodity text-to-speech (TTS) software by marking document speech as follows:

<para> <q s="Redd">Voilet said, <q s="Violet">Rose yelled, <q s="Rose">I'm cybed!</q> in elation,</q></q> said Redd. </para>

Here q means quote and s means speaker , to reduce repetitive strain injuries. This eliminates ambiguity, eliminates obscure entities, is machine-readable, and enables TTS engines to change voices appropriately. Notice that because the quotes are nested, whether the TTS switches voices within nesting can be decided when exporting to audio.

It’s also extensible, meaning that expressiveness can be added if desired:

<q s="Rose" e="joy">I'm cybed!</q>

Spacing

Quite often, especially in poetry, people use spacing to signify markup. This happens even when semantic markup exists to separate poetic forms from the prose. For example, Call of the Wild uses:

<poem> <line> "Old longings nomadic leap,</line> <line> Chafing at custom's chain;</line> <line> Again from its brumal sleep</line> <line> Wakens the ferine strain."</line> </poem>

Prester John obfuscates the markup to make the poem easier to read (or perhaps edit):

<poem><verse><line> 'Diving as if condemned to lave</line><line> Some demon's subterranean cave,</line><line> Who, prisoned by enchanter's spell,</line><line> Shakes the dark rock with groan and yell.' </line></verse></poem>

Adventures of Robin Hood provides a clever twist where the indentation is given outside of the lines to indent—transformation engines will ignore the whitespace by default:

<song> <verse> <line>"_In peascod time, when hound to horn</line> <line>Gives ear till buck be killed,</line> <line>And little lads with pipes of corn</line> <line>Sit keeping beasts afield_--"</line> </verse> </song>

Treasure Island attempts to mark lines of a poem’s verse with indent3 or indent6 , but these classes are extraneous and repetitious:

<poem> <verse> <line class="indent3"> If sailor tales to sailor tunes,</line> <line class="indent6"> Storm and adventure, heat and cold,</line>

The whitespace can be removed, but for the cases where recreating the poem’s form would be laborious to codify, a special syntax is needed:

<poem type="ekphrastic"> </poem>

TEI’s poetry markup is comprehensive and a good source of ideas, but too verbose for simple poems found in fiction novels.

Section Breaks

Last on the list are section breaks. In professionally typeset novels, ornate illustrations can sometimes replace manuscript asterisks ( * * * ). In Call of the Wild, we find:

<para> * * * </para>

Preferably, a semantic section break would be useful, such as:

<section-break />

Or even one of these lesser-known, archaic, perilous, substandard tags:

<br class="section" /> <hr />

GITenberg

GITenberg has the same goal as Standard Ebooks. The main difference is that GITenberg aims to use AsciiDoc. While this is a step forward, there appears to be little attempt at giving a deeper semantic meaning to the prose. Other issues:

Text includes page numbers.

Class names bereft of meaning.

Inconsistent use of header tags (especially h4 ).

). Tables used for formatting.

Mixes content with presentation.

No distinction between salutations and valedictions.

Posted letters are not easily distinguished from prose.

Poems or verses are not marked as such.

Possible use of br to replicate pagination.

While GITenberg is a marked improvement over Project Gutenberg, it would be rather arduous to typeset its novels automatically.

Standard Ebooks

Standard Ebooks are superior to the HTML Writers Guild books in many ways. Additionally, they appear to be kept up-to-date with a growing library of classics. With respect to automatic typsetting, here are some issues (none of which are insurmountable):

Separate files. The chapters and metadata are in separate XML files, which must be recombined. This can be accomplished either using their epub tools or XSLT.

The chapters and metadata are in separate XML files, which must be recombined. This can be accomplished either using their epub tools or XSLT. Unicode characters. Unicode characters, such as hair spaces, embed presentation within the prose that must be removed because typesetting engines often have their own rules for typesetting hair spaces, em dashes, ellipsis, and similar.

Unicode characters, such as hair spaces, embed presentation within the prose that must be removed because typesetting engines often have their own rules for typesetting hair spaces, em dashes, ellipsis, and similar. Blockquotes. Not all XHTML blockquote elements are classified, which makes detecting extended quotations a chore, and therefore difficult to prefix with > .

Victory

Given the extraordinary consistency and detailed attention to modern typography, typesetting Standard Ebooks will produce the most aesthetically pleasing results.

XHTML to Markdown

There are a number of steps necessary to convert Standard Ebooks to Markdown. Broadly, these include:

Download the book. Read the metadata file. Extract the title and author. Concatenate the chapters sequentially. Export each formatted chapter.

Even though ConTeXt can typeset XML documents, we’ll use XSLT—the verbose language only gurus grok without gripes—to convert XHTML into a Markdown document that pandoc can read to produce a native ConTeXt file.

Requirements

Download and install the following tools before beginning:

Once installed, set an environment variable named SAXON_JAR to the fully qualified path (directory plus file name) for saxon-he-10.0.jar . Substitute the version of the software that was downloaded, if different.

Ensure the XSLT processor can run before continuing:

java -jar $SAXON_JAR

Download a Book

Once the requirements are met, open a terminal then run the following commands to download Jane Austen’s Pride and Prejudice:

mkdir -p $HOME/dev/writing/book/novels cd $HOME/dev/writing/book/novels git clone \ https://github.com/standardebooks/jane-austen_pride-and-prejudice

The novel is downloaded.

Read Metadata

Create a new file named se2md.xsl (meaning an extensible stylesheet for transforming Standard Ebook to Markdown) that contains the following:

<?xml version="1.0"?> <xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/"> <xsl:text>hello, world </xsl:text> </xsl:template> </xsl:stylesheet>

We’ll refer to the file as the stylesheet and the file to parse ( content.opf ) as the source document. Although the metadata files are always named content.opf , the epub specification defines the file name in ./src/epub/META-INF/container.xml . If ever Standard Ebooks renames the metadata file name, the stylesheets will fail. Deriving the name from container.xml would be more robust: an adventure that is all yours.

Confirm that the source document can be opened by running the XSLT processor with the stylesheet ( -xsl: ) and source document ( -s: ):

cd $HOME/dev/writing/book/novels java -jar $SAXON_JAR \ -xsl:se2md.xsl \ -s:jane-austen_pride-and-prejudice/src/epub/content.opf

If the transformation worked, you should see:

hello, world

The metadata has been read by the XSLT processor, even though the stylesheet makes no use of it.

Title and Author

Replace the contents of se2md.xsl with:

<?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY nl " "> ]> <xsl:stylesheet version="3.0" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/ | opf:package | opf:metadata"> <xsl:apply-templates /> </xsl:template> <xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']"> <xsl:text>::: </xsl:text> <xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text> <xsl:apply-templates /><xsl:text>&nl;</xsl:text> <xsl:text>:::&nl;&nl;</xsl:text> </xsl:template> <xsl:strip-space elements="*" /> <xsl:template match="*" /> </xsl:stylesheet>

Run the same command to invoke the XSLT processor as before. The output should now resemble:

::: title Pride and Prejudice ::: ::: author Jane Austen :::

A few key lines could use explaining, the first being:

<xsl:template match="/ | opf:package | opf:metadata">

As the XSLT processor reads content.opf , the match attribute instructs the processor to look for any of the following elements:

/ – The root element, which comes before all elements.

– The root element, which comes before all elements. opf:package – Matches the package element within the opf XML namespace. You can see the XML namespace ( xmlns ) by opening content.opf in a plain text editor. Notice how package specifies an XML namespace ( xmlns ) using xmlns="http://www.idpf.org/2007/opf" . In our stylesheet, that same namespace is declared with an opf prefix. When the XSLT processor detects a package element in the opf namespace, the match criteria in our stylesheet fires and the contents of the xsl:template are executed.

– Matches the element within the XML namespace. You can see the XML namespace ( ) by opening in a plain text editor. Notice how specifies an XML namespace ( ) using . In our stylesheet, that same namespace is declared with an prefix. When the XSLT processor detects a element in the namespace, the criteria in our stylesheet fires and the contents of the are executed. opf:metadata – We also have to match the metadata (in the same namespace) because that’s where the title and author elements are nested.

We use <xsl:apply-templates /> to tell the XSLT processor to continue matching and applying additional stylesheets, recursively, as it reads through the source file’s nested hierarchy.

The next line of interest is similar to the previous template:

<xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']">

Like before, our stylesheet defines the dc namespace to be the same as the dc namespace declared inside the source document. This allows us to match both dc:title and dc:creator elements. We further specify the criteria by using an id attribute to hone in on the exact value we want to include in the output document.

Another notable line follows:

<xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text>

Upon matching the id attribute value of either title or author , we write said attribute value into the output document verbatim. The only issue here is that if the order of dc:title and dc:creator are swapped inside content.opf then the output document will be incorrectly ordered.

Export Chapters

If any part of the implementation could be considered fun, this would be it. Let’s break down the overall steps we want to accomplish:

Combine all the chapters. For each section, export its heading. Transform all relevant XHTML elements into Markdown. Map Unicode characters to Markdown equivalents.

The first step is accomplished using the following dense snippet:

<xsl:template match="opf:manifest"> <xsl:variable name="book"> <book> <xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" /> </book> </xsl:variable> <xsl:apply-templates select="$book" /> </xsl:template>

This creates a variable named book that contains the following overall XML structure for all concatenated chapters:

<book> <section epub:type="volume"> <section epub:type="part"> <section epub:type="chapter"> <p>First section's text.</p> </section> </section> </section> <section epub:type="volume"> <section epub:type="part"> <section epub:type="chapter"> <p>Second section's text.</p> </section> </section> </section> </book>

Without wrapping the book element around each chapter’s enclosing section element, there would be no easy way to detect whether a section has a preceding section. We need to check for preceding sections to determine whether the volume or part for a particular chapter is a continuation of the previous volume or part.

Next up is this beast:

<xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" />

Reading content.opf reveals its structure with respect to item elements:

<package> <metadata> <manifest> <item href="text/chapter" id="chapter" media-type="...">

The template declaration ( <xsl:template match="opf:manifest"> ) matches on manifest (in the “opf” namespace), which provides local access to its nested item elements (in the same namespace). We want to extract the href attribute from all item elements to get the relative path to each chapter’s file. After getting its relative path, we want to read that chapter’s XHTML content. Therefore:

copy-of – creates a deep copy of whatever was select ed, verbatim;

– creates a deep copy of whatever was ed, verbatim; document( – calls the document function to read an XML file;

– calls the function to read an XML file; opf:item[ – matches opf:item elements meeting specific criteria;

– matches elements meeting specific criteria; @media-type='application/xhtml+xml' – focuses on opf:item elements that are marked as XML documents;

– focuses on elements that are marked as XML documents; and substring( @id, 0, 8 )='chapter'] – and have an id attribute beginning with the word chapter (see aside, below);

– and have an attribute beginning with the word (see aside, below); /@href – extracts the href attribute from the opf:item , for example text/chapter-1.xml , which is passed into the document function;

– extracts the attribute from the , for example , which is passed into the function; . – instructs the XSLT processor to use a relative path from the XML document’s directory when reading the files; and

– instructs the XSLT processor to use a relative path from the XML document’s directory when reading the files; and /h:html/h:body/h:section – means to discard the html and body elements from the XHTML document, returning only the section element for copy-of to include.

Aside, Standard Ebooks does not have a machine-readable way to tell chapter files apart from other file types. We fudge it by checking that the @id attribute of each item in the manifest starts with chapter . String comparisons are almost always brittle solutions in software development because there is no guaranteed contract that defines a small, finite set of possible values that everyone agrees upon. (For example if a group of editors translated the books into French, they could prefix the chapter files with chapitre instead, which would break the stylesheet’s code.) Ideally, each item element would be classified with a value that could be used to distinguish chapter files from supplementary files.

The second step entails exporting the author and title from each chapter file. Replace the contents of se2md.xsl again:

<?xml version="1.0"?> <!DOCTYPE xsl:stylesheet [ <!ENTITY nl " "> ]> <xsl:stylesheet version="3.0" xmlns:opf="http://www.idpf.org/2007/opf" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:h="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops"> <xsl:output method="text" encoding="utf-8" /> <xsl:template match="/ | opf:package | opf:metadata"> <xsl:apply-templates /> </xsl:template> <xsl:template match="dc:title[@id='title'] | dc:creator[@id='author']"> <xsl:text>::: </xsl:text> <xsl:value-of select="@id" /><xsl:text>&nl;</xsl:text> <xsl:apply-templates /><xsl:text>&nl;</xsl:text> <xsl:text>:::&nl;&nl;</xsl:text> </xsl:template> <xsl:template match="opf:manifest"> <xsl:variable name="book"> <book> <xsl:copy-of select="document( opf:item[ @media-type='application/xhtml+xml' and substring( @id, 0, 8 )='chapter']/@href, . ) /h:html/h:body/h:section" /> </book> </xsl:variable> <xsl:apply-templates select="$book" /> </xsl:template> <xsl:template match="book"> <xsl:apply-templates /> </xsl:template> <xsl:template match="h:section[matches( @epub:type, '.*' )]"> <xsl:if test="count( preceding::h:section[@id=current()/@id] ) = 0"> <xsl:for-each select="0 to count( ancestor::h:section )"> <xsl:text>#</xsl:text> </xsl:for-each> <xsl:text> </xsl:text> <xsl:value-of select="@id" /> <xsl:text>&nl;&nl;</xsl:text> </xsl:if> <xsl:apply-templates /> </xsl:template> <xsl:strip-space elements="*" /> <xsl:template match="*" /> </xsl:stylesheet>

Re-run the XSLT processor to see:

::: title Pride and Prejudice ::: ::: author Jane Austen ::: # chapter-1 # chapter-2

When run against Victor Hugo’s Les Misérables, we see:

::: title Les Misérables ::: ::: author Victor Hugo ::: # volume-1 ## book-1-1 ### chapter-1-1-1 ### chapter-1-1-2 ... skipped for brevity... ## book-1-2 ... skipped for brevity... # volume-2 ## book-2-1 ### chapter-2-1-1

Each XHTML chapter file may repeat the volume and part number. This interferes with our ability to both autogenerate a table of contents and start each volume or part on a new page. Recall that we introduced a book element to nest all the concatenated XHTML document sections together. Let’s look a little closer at how this is leveraged:

<xsl:template match="h:section[matches( @epub:type, '.*' )]"> <xsl:if test="count( preceding::h:section[@id=current()/@id] ) = 0"> <xsl:for-each select="0 to count( ancestor::h:section )"> <xsl:text>#</xsl:text> </xsl:for-each> <xsl:text> </xsl:text> <xsl:value-of select="@id" /> <xsl:text>&nl;&nl;</xsl:text> </xsl:if> <xsl:apply-templates /> </xsl:template>

The above template is fairly generic in that it isn’t specific to any one type of section. It handles part , volume , chapter , and any other nesting levels or names that the XHTML throws at it. Upon inspection:

h:section[matches( @epub:type, '.*' )] – Matches any section element (in the XHTML namespace) that has an epub:type attribute having at least one character.

– Matches any element (in the XHTML namespace) that has an attribute having at least one character. count( ... ) = 0 – Guard against redundant section elements.

– Guard against redundant elements. preceding::h:section[@id=current()/@id] – Collects all previous section elements that have the same id attribute as the current section . For example, if the current id value is volume-1 and a previous section element had an id value of volume-1 , then the result from the count function will be greater than zero. Thus repeated sections are skipped.

– Collects all previous elements that have the same attribute as the current . For example, if the current value is and a previous element had an value of , then the result from the function will be greater than zero. Thus repeated sections are skipped. 0 to count( ancestor::h:section ) – Iterates up the nested chain of section elements such that the nesting depth controls the number of # symbols written. If the book has volumes, parts, and chapters, then each chapter will be marked using ### .

Note that the text for each heading is really a placeholder. When styling the chapters using ConTeXt, the text will be rewritten altogether. To reiterate, the choice of how to represent numerals is a presentation decision.

For the third step, we want to convert each XHTML element into its equivalent Markdown. In the interest of brevity, here’s how this is accomplished for a few simple XHTML elements:

<xsl:template match="h:p"> <xsl:apply-templates /> <xsl:text>&nl;&nl;</xsl:text> </xsl:template> <xsl:template match="h:em | h:i"> <xsl:text>_</xsl:text> <xsl:apply-templates /> <xsl:text>_</xsl:text> </xsl:template> <!-- Bold is swapped for small caps by the typesetting engine. --> <xsl:template match="h:strong | h:b"> <xsl:text>**</xsl:text> <xsl:apply-templates /> <xsl:text>**</xsl:text> </xsl:template> <xsl:template match="h:abbr | h:span"> <xsl:apply-templates /> </xsl:template>

And so on. The full conversion is quite long; having explained the high- and many low-level concepts necessary to do the conversion, we’ll forgo delving into the technical minutae of the stylesheet code for converting the remaining XHTML elements. Go on, thank me for sparing you.

The fourth and final step isn’t immediately obvious and may not be entirely necessary, if you are fine with letting pandoc and ConTeXt figure out how to handle Unicode characters. Otherwise, inject the following code into the stylesheet:

<xsl:output method="text" encoding="utf-8" use-character-maps="ununicode" /> <!-- Map specific Unicode characters to Markdown equivalents. --> <xsl:character-map name="ununicode"> <!-- hair space --> <xsl:output-character character=" " string="" /> <!-- ellipsis --> <xsl:output-character character="…" string="..." /> <!-- en-dash --> <xsl:output-character character="–" string="--" /> <!-- em-dash --> <xsl:output-character character="—" string="---" /> <!-- two em-dash --> <xsl:output-character character="⸺" string="--- ---" /> <!-- three em-dash --> <xsl:output-character character="⸻" string="--- --- ---" /> </xsl:character-map>

As a starting point, the downloadable stylesheet (below) transforms many blockquote and div environments, poetry, tables, and more.

The end? Well, almost.

Annotations

The Markdown output contains many blocks that resemble:

::: annotation Text :::

These will have to be translated into ConTeXt environments and styled separately. For now, our mission is accomplished: by and large, we have translated classic novels marked up by Standard Ebooks into Markdown.

Download

Download the complete stylesheet and build script, released under an MIT license. Be sure to copy build-template into $HOME/bin for the build script to work.

Summary

In practice, communicating and formalizing a syntax that wholly separates content from presentation is hard. Even when the intent is clear, such as with the HTML Writers Guild and Standard Ebooks, there are a plethora of ways that the two get inseparably mingled. In this review of Project Gutenberg Projects, we encountered the following issues and ideas:

By line – “By” is implied just by a work being authored

– “By” is implied just by a work being authored Subtitle – “Or” can be made machine-readable

– “Or” can be made machine-readable Page numbering – Avoid transcribing numbers altogether

– Avoid transcribing numbers altogether Table of Contents – Machine-generate from chapter titles

– Machine-generate from chapter titles Chapter numbers – Machine-generate and avoid styling

– Machine-generate and avoid styling Capitalization – Avoid all caps, let typesetting change the case

– Avoid all caps, let typesetting change the case Paragraphs – One paragraph per element, keep lines together

– One paragraph per element, keep lines together Quotes – Consider marking up speech using TTS-friendly elements

– Consider marking up speech using TTS-friendly elements Spacing – Prefer semantic markup, avoid indenting with spaces

– Prefer semantic markup, avoid indenting with spaces Section breaks – Use semantic markup that a computer can style

Standard Ebooks avoid many pitfalls in their separation of content from presentation.

Email me your suggestions, corrections, or thoughts on this topic.

About the Author

My career has spanned tele- and radio communications, enterprise-level e-commerce solutions, finance, transportation, modernization projects in both health and education, and much more.

Delighted to discuss opportunities to work with revolutionary companies combatting climate change.