markatu

Oct 11, 2018

Inventing a lightweight markup language.

So, I started writing this blog using markdown. But, I soon found that markdown wasn't able to generate the kind of HTML that I wanted. In this article, I talk about the techniques I used to invent my own lightweight markup language. I took inspiration from markdown's brevity and slim's flexibility, and threw in some constructs from high-level programming languages.

If you're more interested in the final product than the journey, you can check out the final git repository, which has a command line tool for turning things like this into HTML:

h2#title: markatu small.w3-right: Oct 11, 2018 h3#subtitle: Inventing a lightweight markup language. So, I started writing this blog using markdown. But, I soon found that markdown wasn't able to generate the kind of HTML that I wanted. In this article, I talk about the techniques I used to invent my own lightweight markup language. I took inspiration from markdown's brevity and <slim:slim-lang.org>'s flexibility, and threw in some constructs from high-level programming languages. If you're more interested in the final product than the journey, you can check out the final git <repository:https://github.com/bduggan/markatu>, which has a command line tool for turning things like this into HTML: example=div.w3-panel,w3-card,w3-light-grey,w3-code { +INCLUDE index.mt 1-25 } Some features of the final language:

Some features of the final language:

Uses punctuation for things like bold, bullets and inline code. (like markdown).

Can generate arbitrary nested tags with attributes, including ids and classes. (like slim )

Uses blank lines to separate paragraphs (like markdown).

Supports aliases (like example above).

above). Supports including other files, as well as running them, and capturing their output.

Anyway, here are the techniques I used to make a parser and generate HTML. By the way, if you like examples instead -- the source code for this blog entry is at the bottom of this page.

Let's start with paragraphs: blocks of text separated by blank lines. The grammar on the right parses paragraphs. The % is a shortcut for "separated by". So, % "



" matches paragraphs which are separated by two newlines in a row. Similarly, a paragraph is a sequence of lines separated by single newlines. \N matches anything except a

. Note that we have a regex , a rule , and a token . A token is a regex without backtracking (like a lexer). A rule is a token but spaces in the rule match whitespace in the input. Here's the output → When we print the value returned by parse using say , we get a nice little tree of matches. grammar Markatu::Grammar { rule TOP { <p>+ % "



" } regex p { <line>+ % "

" } token line { \N+ } } say Markatu::Grammar.parse: q:to/END/; A paragraph. A paragraph that has two lines. END ｢A paragraph. A paragraph that has two lines. ｣ p => ｢A paragraph.｣ line => ｢A paragraph.｣ p => ｢A paragraph that has two lines.｣ line => ｢A paragraph that｣ line => ｢has two lines.｣

You are probably saying, okay, I could have just called split("



") to get all the paragraphs, and you are right, but stick with me, it gets better.

We have a tree, but we want HTML, so let's make a quick Node class to represent a DOM node. A node has a tag, maybe some attributes (a hash), maybe some text (a scalar), and maybe some children (an array). Rendering is recursive. We could use typing and declare the types of things too (e.g. all the children are Node s) but for now I want to be lazy and quick. And anyhow we use sigils to at least indicate the container type of the attributes: $ is a scalar,`%` is a Hash of attributes, and @ makes an Array of children. By the way, we make the tag optional, so that we can have elements of the DOM tree that just group other elements together. Okay, here's the output of the code on the right. <div><p>hello</p> <pre id="earth">world</pre> </div> class Node { has $.tag; has $.text = ''; has %.attrs; has @.children; method open-tag { return "" unless $.tag; "<$.tag" ~ ( %.attrs.kv.map: { qq[ $^key="$^value"] } ).join ~ ">" } method close-tag { return "" unless $.tag; "</$.tag>

" } method render { self.open-tag ~ @.children.map({ .render // '' }).join ~ $.text ~ self.close-tag; } } say Node.new( :tag<div>, children => ( Node.new(:tag<p>, :text<hello>), Node.new(:tag<pre>, :text<world>, :attrs( %(id => 'earth') )) ) ).render;

Let's put these two together and generate some HTML.

To generate something, we make an "actions" class -- a class whose methods have the same names as the rules in the grammar. When the grammar matches a rule against some portion of the input, the corresponding method in the actions class is called. The argument that comes in, $/ is the match object -- which references the current text that was matched. It has a little stash that can be accesed by calling .make (to set a value) or .made (to get a value). In our case we are making a dom tree so we will be sending Node objects to .make and retrieving them with .made . Again -- the output is below the code on the right. So, okay, maybe that was more work than writing HTML. But the input was this: A paragraph. A paragraph that has two lines. And now we can have some fun and make our language a bit better. class Markatu::Actions { method TOP($/) { $/.make: Node.new: children => $<p>.map: *.made } method p($/) { $/.make: Node.new: :tag<p>, :text($<line>.map(*.made).join("

")) } method line($/) { $/.make: "$/" } } my $actions = Markatu::Actions.new; my $match = Markatu::Grammar.parse: q:to/END/, :$actions; A paragraph. A paragraph that has two lines. END say $match.made.render <p>A paragraph.</p> <p>A paragraph that has two lines.</p>

First let's do some basics like bold or monospace .

We break up our line into phrases, break up our phrases into characters. ｢Some *bold*, some `code`.｣ phrase => ｢Some ｣ phrase => ｢*bold*｣ bold => ｢bold｣ phrase => ｢, some ｣ phrase => ｢`code`｣ code => ｢code｣ phrase => ｢.｣ ｢Some `code with a * in it`.｣ phrase => ｢Some ｣ phrase => ｢`code with a * in it`｣ code => ｢code with a * in it｣ phrase => ｢.｣ And use these new definitions to construct DOM nodes. grammar G { token line { <phrase>+ } token phrase { <bold> | <code> | <-[`*]>+ } token bold { '*' <( <-[*]>+ )> '*' } token code { '`' <( <-[`]>+ )> '`' } } say G.parse: :rule<line>, 'Some *bold*, some `code`.'; say G.parse: :rule<line>, 'Some `code with a * in it`.';

I'm going to skip some of the boring stuff of building new nodes etc, and instead fast forward to an interesting part: making nested tags with lists of classes and ids, and assigning them to aliases.

Here's the source. h1.title: An h1 whose class is "title". div.w3-col,s6 { Inside a div with two classes. Still inside a div. } half=div.w3-col,s6 { I am tired of typing names of classes. Let's make "half" an alias. } half { This is the same as `div.w3-col,s6`. } Here's the rendered HTML. <h1 class="title"> An h1 whose class is "title". </h1> <div class="w3-col s6"> <p> Inside a div with two classes. </p> <p> Still inside a div. </p> </div> <div class="w3-col s6"> <p> I am tired of typing names of classes. Let's make "half" an alias. </p> </div> <div class="w3-col s6"> <p> This is the same as <code>div.w3-col,s6</code>. </p> </div> And here's the relevant part of the parser. token label { [$<declare-variable>=\w+ '=']? $<tag>=[\w+] ['#' $<id>=\w+ ]? ['.' <class-list>]? } rule tag { <label> [ | ':' $<text>=\V* | '{' "

"? [ <blocks> "

"? ]+ % "

" '}' ] }

Well, for more details, head and over to the github repository. There you can find the source, a test suite with lots of examples, as well as mt -- a command line tool for converting files from markatu into HTML.

Conclusions

Parsing and inventing languages can be fun.

Perl 6 Grammars provide nice building blocks for experimenting with languages.

Lightweight markup languages can be programmer-friendly.

Here is the source for this blog entry: