One of my long-term projects is cleaning up the Unix manual-page corpus so it will render nicely in HTML.

The world is divided into two kinds of people. One kind hears that, just nods and says “That’s nice,” having no idea what it entails. The other kind sputters coffee onto his or her monitor and says something semantically equivalent to “How the holy jumping fsck do you think you’re ever going to pull that off?”

The second kind has a clue. The Unix man page corpus is scattered across tens of thousands of software projects. It’s written in a markup – troff plus man macros – that is a tag soup notoriously resistent to parsing. The markup is underspecified and poorly documented, so people come up with astoundingly perverse ways of abusing it that just happen to work because of quirks in the major implementation but confuse the crap out of analysis tools. And the markup is quite presentation oriented; much of it is visual rather than structural and thus difficult to translate well to the web – where you don’t even know the “paper” size of your reader’s viewer, let alone what fonts and graphics capabilities it has.

Nevertheless, I’ve been working this problem for seventeen years and believe I’m closing in on success in, maybe, another five or so. In the rest of this post I’ll describe what I’m doing and why, so I have an explanation to point to and don’t have to repeat it.

First, how we got where we are. Unix documentation predates even video terminals. When the first manual page was written in the very early 1970s, the way you displayed stuff was to print it – either on your teletype or – slightly later – a phototypesetter.

Crucially, while the photypesetter could do fairly high-quality typesetting with multiple fonts and kerning, the teletype was limited to a single fixed-width font. Thus, from nearly the beginning, the Unix documentation toolchain was adapted to two different output modes, one assuming only very limited capability from its output device.

At the center of the “Unix documentation toolchain” was troff (for phototypesetters) and its close variant nroff (for ttys). Both interpreted a common typesetting language. The language is very low-level and visually oriented, with commands like “insert line break” and “change to specified font”. Its distinguishing feature is that (most) troff requests are control words starting with a dot at the beginning of a line; thus, “insert line break” is “.br”. But some requests were “escapes” begun with a backslash and placed inline; thus, “\fI” means change to italic font”.

Manual pages were never written directly in troff. Instead, they were (and are) written mostly in macros expanded to sequences of troff requests by a preprocessor. Insteas of being purely visual, many of these macros are structural; they say things like “start new pagraph” or “item in bulleted list”. I say “mostly” because manual pages still contain low-level requests like font changes.

Text-substitution macro languages have become notorious for encouraging all manner of ingenious but ugly and hard-to-understand hackery. The troff language helped them get that reputation. Users could define their own macros, and sometimes did. The design encouraged visual microtweaking of pages to get the appearance just right – provided you know things like your paper size and the font capabilities of your output device exactly. In the hands of an expert troff could produce spare, elegant typesetting that still looks good decades later.

By 1980 there was already a large corpus – thousands, at least – of manual pages written in troff markup. The way it was rendered was changing, however.

First, ttys were displaced by tube terminals – this was in the late 1970s, around the time I started programming. nroff was quickly adapted to produce output for these, which is why we still use the “man” command in terminal emulators today. That’s nroff behind it turning man-page markup into fixed-width characters on your screen.

Not long after that that people almost completely stopped printing manual pages. The payoff from cute troff tricks declined because tube terminals were such a limited rendering device. This encouraged a change in the way people wrote them – simpler, with less purely visual markup, more structural. Today there’s a noticeable gradient in markup complexity by age of the page – newer ones tend to be simpler and you almost never see the really old-school style of elaborate troff tricks outside of the documentation of GNU troff itself.

Second, in the early 1980s, laser printers and Postscript happened. Unix man pages themselves changed very little in response because nroff-to-terminals had already become so important, but the entire rest of the range of troff’s use cases simplified to “generate Postscript” over the next decade. Occasionally people still ask it to emit HP’s printer language; that’s about the only exception left. The other back-end typesetting languages troff used to emit are all dead.

But the really big disruption was the World Wide Web.

By about 1997 it was becoming obvious that in the future most documentation would move to the web; the advantages of the hyperlink were just too obvious to ignore. The new wave in documentation markup languages, typified by DocBook, was designed for a Web-centric world in which – as with nroff on terminals – your markup can’t safely make a lot of assumptions about display size or fonts.

To deal with this, the new document markup languages were completely structural. But this created a huge problem. How were we going to get the huge pile of grubby, visually-marked-up Unix man pages into purely structural markup?

Yes, you can translate a straight visual markup into a sort of pidgin HTML. That’s what tools like man2html and troff2html do. But this produces poor, ugly HTML that doesn’t exploit the medium well. One major thing you lose is tables. The man pages of these tools are full of caveats and limitations. Basically, they suck.

Trying to jawbone every project maintainer in the world into moving their masters to something else web-friendly by hand seemed doomed. What we really needed was mechanical translation from structural man macros (including table markup) to a structural markup.

When I started thinking about this problem just after Y2K, the general view among experts was that it was impossible, or at least unfeasibly hard barring strong AI. Trying to turn all that messy, frequently malformed visual tag soup into clean structure seemed like a job only a human could handle, involving recognition of high-level patterns and a lot of subtle domain and context knowledge.

Ah, but then there was (in his best Miss Piggy voice) moi.

I have a background in AI and compiler technology. I’m used to the idea that pattern-recognition problems that seem intractable can often be reduced to large collections of chained recognition and production rules. I’ve forgotten more about writing parsers for messy input languages than most programmers ever learn. And I’m not afraid of large problems.

The path forward I chose was to lift manual pages to DocBook-XML, a well-established markup used for long-form technical manuals. “Why that way?” is a reasonable question. The answer is something a few experiments showed me: the indirect path – man markup to DocBook to HTML – produces much better-quality HTML than the rather weak direct-conversion tools.

But lifting to DocBookXML is a hard problem, because the markup used in man pages has a number of unfortunate properties even beyound those I’ve already mentioned. One is that the native parser for it doesn’t, in general, throw errors on ill-formed or invalid markup. Usually such problems are simply ignored. Sometimes they aren’t but produce defects that are hard for a human reader scanning quickly to notice.

The result is that manual pages often have hidden cruft in them. That is, they may render OK but they do so essentially by accident. Markup malformations that would throw errors in a stricter parser pass unnoticed.

This kind of cruft accumulates as man pages are modified and expanded, like deleterious mutations in a genome. The people who modify them are seldom experts in roff markup; what they tend to do is monkey-copy the usage they see in place, including the mistakes. Thus defect counts tend to be proportional to age and size, with the largest and oldest pages being the cruftiest.

This becomes a real problem when you’re trying to translate the markup to something like DocBook-XML. It’s not enough be able to lift clean markup that makes structural sense; you have to deal with the accumulated cruft too.

Another big one, of course, is that (as previously noted) roff markup is presentational rather than semantic. Thus, for example, command names are often marked by a font change, but there’s no uniformity about whether the change is to italic, bold, or fixed width.

XML-DocBook wants to do structured tagging based on the intended semantics of text. If you’re starting from presentation markup, you have to back out the intended semantics based on a combination of cliche recognition and context rules. My favorite tutorial example is: string marked by a font change and containing “/” is wrapped by a DocBook filename tag if the name of the enclosing section is “FILES”.

But different people chose different cliches. Sometimes you get the same cliche used for different semantic purpose by different authors. Sometimes multiple cliches could pattern-match to the same section of text.

A really nasty problem is that roff markup is not consistent (he said, understating wildly) about whether or not its constructions have end-of-scope markers. Sometimes it does – the .RS/.RE macro pair for changing relative indent. More often, as for example in font changes, it doesn’t. It’s common to see markup like “first we’re in \fBbold,\fIthen italic\fR.”

Again, this is a serious difficulty when you’re trying to lift to a more structured XML-based markup with scope enders for everything. Figuring out where the scope ends should go in your translation is far from trivial even for perfectly clean markup.

Now think about all the other problems interact with the cruft. Random undetected cruft can be lying in wait to confuse your cliche recognition and trip up your scope analyzer. In truth, until you start feeling nauseous or terrified you have not grasped the depth of the problem.

The way you tackle this kind of thing is: Bite off a piece you understand by writing a transformation rule for it. Look at the residuals for another pattern that could be antecedent to another transformation. Lather, rinse, repeat. Accept that as the residuals get smaller, they get more irregular and harder to process. You won’t get to perfection, but if you can get to 95% you may be able to declare victory.

A major challenge is keeping the code structure from becoming just as grubby as the pile of transformation rules – because if you let that slide it will become an unmaintainable horror. To achive that, you have to be constantly be looking for opportunities to generalize and make your engine table-driven rather than writing a lot of ad-hoc logic.

It took me a year of effort to get to doclifter 1.0. It could do a clean lift on 95% of the 5548 man pages in a full Red Hat 7.3 workstation install to DocBook. (That’s a bit less than half the volume of the man pages on a stock Ubuntu installation in 2018.) The reaction of topic experts at the time was rather incredulous. People who understood the problem had trouble believing doclifter actually worked, and no blame for that – I’m good, but it was not a given that the problem was actually tractable. In truth even I was a little surprised at getting that good a coverage rate without hitting a wall.

Those of you a bit familiar with natural-language processing will be unsurprised to learn that at every iteration 20% of the remaining it-no-work pages gave me 80% of the problems, or that progress slowed in an inverse-geometric way as I got closer to 1.0.

In retrospect I was helped by the great simplification in man markup style that began when tube terminals made nroff the renderer for 99% of all man page views. In effect, this pre-adapted man page markup for the web, tending to select out the most complex and intractable troff features in favor of simple structure that would actually render on a tube terminal.

Just because I could, I also taught doclifter to handle the whole rest of the range of troff markups – ms, mm, me and so forth. This wasn’t actually very difficult once I had the framework code for man processing. I have no real idea how much this capability has actually been used.

With doclifter production-ready I had the tool required to drain the swamp. But that didn’t mean I was done. Oh no. That was the easy part. To get to the point where Linux and *BSD distributions could flip a switch and expect to webify everything I knew I’d have to push the failure rate of automated translation another factor of five lower, to the point where the volume of exceptions could be reasonably handled by humans on tight deadlines.

There were two paths forward to doing that. One was to jawbone project maintainers into moving to new-school, web-friendly master formats like DocBook and asciidoc. Which I did; as a result, the percentage of man pages written that way has gone from about 2% to about 6%.

But I knew most projects wouldn’t move, or wouldn’t move quickly. The alternative was to prod that remnant 5%, one by one, into fixing their crappy markup. Which I have now been doing for fifteen years, since 2003.

Every year or two I do a sweep through every manual page in sight of me, which means everything on a stock install of the currently dominant Linux distro, plus a boatload of additional pages for development tools and other things I use. I run doclifter on every single one, make patches to fix broken or otherwise untranslatable markup, and mail them off to maintainers. You can look at my current patch set and notes here.

I’ve had 579 patches accepted so far, so I am getting cooperation. But the cycle time is slow; there wouldn’t be much point in sampling the corpus faster than the refresh interval of my Linux distribution, which is about six months.

In a typical round, about 80 patches from my previous round have landed and I have to write maybe two dozen new ones. Once I’ve fixed a page it mostly stays fixed. The most common exception to that is people modifying command-option syntax and forgetting to close a “]” group; I catch a lot of those. Botched font changes are also common; it’s easy to write one of those \-escapes incorrectly and not notice it.

There are a few categories of error that, at this point, cause me the most problems. A big one is botched syntax in descriptions of command-line options, the simplest of which is unbalanced [ or ] in option groups. But there are other things that can go wrong; there are people, for example, who don’t know that you’re supposed to wrap mandatory switches and arguments in { } and use something else instead, often plain parentheses. It doesn’t help that there is no formal standard for this syntax, just tradition – but some tools will break if you flout it.

A related one is that some people intersperse explanatory text sections in their command synopses, or follow a command synopsis with a summary paragraph. The proper boundary to such trailing paragraphs is fiendishly difficult to parse because distinguishing fragments of natural language from command syntax is hard, and DocBook markup can’t express the interspersed text at all. This is one of the very few cases in which I have to impose a usage restriction in order to get pages to lift. If you maintain a manual page, don’t do these these things!. If doclifter tells you “warning – dubious content in Synopsis”, please fix until it doesn’t.

Another bane of my life has been botched list syntax, especially from misuse of the .TP macro. This used to be very common, but I’ve almost succeeded in stamping it out; only around 20 instances turned up in my latest pass. The worst defects come from writing a “bodiless .TP”, an instance with a tag but no following text before another .TP or a section macro. This is the most common cause of pages that lift to ill-formed XML, and it can’t be fixed by trickier parsing. Believe me, I’ve tried…

Another big category of problems is people using very low-level troff requests that can’t be parsed into structure, like .in and .ce and especially .ti requests. And yet another is abuse of the .SS macro to bolden and outdent text that isn’t really a section heading.

But over time I have been actually been succeeding in banishing a lot of this crap. Counting pages that have moved to web-friendly master formats, the percentage of man-page content that can go automatically to really high-quality HTML (with tables and figures and formulas properly carried along) is now over 99%.

And yes, I do think what I see in a mainsteam Linux distro is a sufficiently large and representative sample for me to say that with confidence. Because I notice that my remaining 75 or so of awkward cases are now heavily concentrated around a handful of crotchety GNU projects; groff itself being a big one.

I’ll probably never get it perfect. Some fraction of manual pages will always be malformed enough to terminally confuse my parser. Strengthening doclifter enough to not barf on more of them follows a pattern I’ve called a “Zeno tarpit” – geometrically increasing effort for geometrically decreasing returns.

Even if I could bulletproof the parser, perfection on the output side is hard to even define. It depends on your metric of quality and how many different rendering targets you really want to support. There are three major possibles: HTML, PostScript, and plain text. DocBook can render to any of them.

There’s an inescapable tradeoff where if you optimize for one rendering target you degrade rendered quality for the others. Man-page markup is an amphibian – part structural, part visual. If you use it visually and tune it carefully you will indeed get the best output possible on any individual rendering target, but the others will look pretty terrible.

This is not a new problem. You could always get especially pretty typography in man if you decide you care most about the Postcript target and use troff as a purely visual markup, because that’s what it was designed for. But you take a serious hit in quality of generated HTML because in order to do *that* right you need *structural* markup and a good stylesheet. You take a lesser hit in plain-text rendering, especially near figures and tables.

On the other hand, if you avoid purely visual markup (like .br, .ce, \h and \v), emphasizing structure tags motions, you can make a roff master that will render good HTML, good plain text, and something acceptable if mediocre on Postscript/PDF. But you get the best results not by naive translation but by running a cliche recognizer on the markup and lifting it to structure, then rendering that whatever you want via stylesheets. That’s what doclifter does.

Lifting to DocBook makes sense under the assumption that you want to optimize the output to HTML. This pessimizes the Postscript output, but the situation is not quite symmetrical among the three major targets; what’s good for HTML tends to be coupled to what’s good for plain text. My big cleanup of the man page corpus, now more than 97% complete after seventeen years of plugging at it, is based on the assumption that Postscript is no longer an important target, because who prints manual pages anymore?

Thus, the changes I ship upstream try to move everyone towards structural markup, which is (a) less fragile, (b) less likely to break on alien viewers, and (c) renders better to HTML.