Memories of writing a parser for man pages March 24th, 2018

I generally enjoy being bored, but sometimes enough is enough—that was the case a Sunday afternoon of 2015 when I decided to start an open source project to overcome my boredom.

In my quest for ideas, I stumbled upon a request to build a “Man page viewer built with web standards” by Mathias Bynens and without thinking too much, I started coding a man page parser in JavaScript, which after a lot of back and forths, ended up being Jroff.

Back then, I was familiar with manual pages as a concept and used them a fair amount of times, but that was all I knew, I had no idea how they were generated or if there was a standard in place. Two years later, here are some thoughts on the matter.

The first thing that surprised me at the time, was the notion that manpages at their core are just plain text files stored somewhere in the system (you can check this directory using the manpath command).

This files not only contain the documentation, but also formatting information using a typesetting system from the 1970s called troff .

troff, and its GNU implementation groff, are programs that process a textual description of a document to produce typeset versions suitable for printing. It’s more ‘What you describe is what you get’ rather than WYSIWYG. — extracted from troff.org

If you are totally unfamiliar with typesetting formats, you can think of them as Markdown on steroids, but in exchange for the flexibility you have a more complex syntax:

The groff file can be written manually, or generated from other formats such as Markdown, Latex, HTML, and so on with many different tools.

Why groff and man pages are tied together has to do with history, the format has mutated along time, and his lineage is composed of a chain of similarly-named programs: RUNOFF > roff > nroff > troff > groff.

But this doesn’t necessarily mean that groff is strictly related to man pages, it’s a general-purpose format that has been used to write books and even for phototypesetting.

Moreover, It’s worth noting that groff can also call a postprocessor to convert its intermediate output to a final format, which is not necessarily ascii for terminal display! some of the supported formats are: TeX DVI, HTML, Canon, HP LaserJet4 compatible, PostScript, utf8 and many more.

Other of the cool features of the format is its extensibility, you can write macros that enhance the basic functionalities.

With the vast history of *nix systems, there are several macro packages that group useful macros together for specific functionalities according to the output that you want to generate, examples of macro packages are man , mdoc , mom , ms , mm , and the list goes on.

Manual pages are conventionally written using man and mdoc .

You can easily distinguish native groff commands from macros by the way standard groff packages capitalize their macro names. For man , each macro’s name is uppercased, like .PP, .TH, .SH, etc. For mdoc , only the first letter is uppercased: .Pp, .Dt, .Sh.

Whether you are considering to write your own groff parser, or just curious, these are some of the problems that I have found more challenging.

Formally, groff has a context-free grammar, unfortunately, since macros describe opaque bodies of tokens, the set of macros in a package may not itself implement a context-free grammar.

This kept me away (for good or bad) from the parser generators that were available at the time.

Most of the macros in mdoc are callable, this roughly means that macros can be used as arguments of other macros, for example, consider this:

The macro Fl (Flag) adds a dash to its argument, so Fl s produces -s

(Flag) adds a dash to its argument, so produces The macro Ar (Argument) provides facilities to define arguments

(Argument) provides facilities to define arguments The Op (Optional) macro wraps its argument in brackets, as this is the standard idiom to define something as optional.

(Optional) macro wraps its argument in brackets, as this is the standard idiom to define something as optional. The following combination .Op Fl s Ar file produces [-s file] because Op macros can be nested.

Something that really confused me was the lack of a canonical, well defined and clear source to look at, there’s a lot of information in the web which assumes a lot about the reader that it takes time to grasp.

To wrap up, I will offer to you a very short list of macros that I found interesting while developing jroff:

man

TH: when writing manual pages with man macros, your first line that is not a comment must be this macro, it accepts five parameters: title section date source manual

macros, your first line that is not a comment must be this macro, it accepts five parameters: title section date source manual BI: bold alternating with italics (especially useful for function specifications)

BR: bold alternating with Roman (especially useful for referring to other manual pages)

mdoc

.Dd, .Dt, .Os: similar to how man macros require the .TH the mdoc macros require these three macros, in that particular order. Their initials stand for: Document date, Document title and Operating system.

macros require the the macros require these three macros, in that particular order. Their initials stand for: Document date, Document title and Operating system. .Bl, .It, .El: these three macros are used to create list, their names are self-explanatory: Begin list, Item and End list.