Rust macros in html5ever

Keegan McAllister

November 6, 2014

Navigate with ← and → keys, or view all slides. Available at kmcallister.github.io

Servo is an experimental browser engine from Mozilla Research

Developed by a dozen Mozilla employees + hundreds of others

Layout code is all-new and written in Rust

2013-09: passed Acid1 test

2014-01: parallel layout (!!!)

2014-05: passed Acid2 test

2014-08: vertical writing (preliminary)

2014-10: incremental layout

C libs replaced in Rust:

2013-10: CSS parsing

2014-07: string interning

2014-10: HTML parsing

2014-11: OpenGL windowing

Faster, cleaner, more safe, more correct

Future: JS engine, rasterizer?

What is the HTML syntax? Depends who you ask!

W3C spec: 8 pages

WHATWG spec: 114 pages

Which one is relevant for real browsers and content?

A start tag whose tag name is "nobr" Reconstruct the active formatting elements, if any. If the stack of open elements has a nobr element in scope, then this is a parse error; run the adoption agency algorithm for the tag name "nobr", then once again reconstruct the active formatting elements, if any. Insert an HTML element for the token. Push onto the list of active formatting elements that element.

When the steps below require the UA to generate implied end tags , then, while the current node is a dd element, a dt element, an li element, … When the steps below require the UA to generate all implied end tags thoroughly , then, while the current node is a caption element, a colgroup element, an dd element, …

The upside: Any crap HTML (even 1996 GeoCities) will parse the same in every modern browser

<kmc> should I be scared when the WHATWG spec says "for historical reasons"? because I feel like that phrase already applies to the entire document

<Ms2ger> Correct

<Ms2ger> That just means "for historical reasons we dislike particularly"

html5ever is Servo's new HTML parser, written mostly by me over the course of about 7 months

We now have 8 contributors and several users!

Fast, safe, generic, native UTF-8

Rust and C APIs available

Factor the problem into:

Small amount of difficult macro code

Large amount of mindlessly transcribed rules

Bonus: code looks like the spec!

12.2.4.1 Data state Consume the next input character: U+0026 AMPERSAND (&) Switch to the character reference in data state. U+003C LESS-THAN SIGN (<) Switch to the tag open state. U+0000 NULL Parse error. Emit the current input character as a character token. EOF Emit an end-of-file token. Anything else Emit the current input character as a character token.

match self . state { states :: Data => loop { match get_char ! ( self ) { '&' => go ! ( self : consume_char_ref ), '<' => go ! ( self : to TagOpen ), '\0' => go ! ( self : error ; emit '\0' ), c => go ! ( self : emit c ), }},

In Hubbub this is about 150 lines of C.

In html5lib it's 20 lines of Python.

fn feed ( & mut self , input : String ); fn get_char ( & mut self ) -> Option < char > ; fn step ( & mut self ) -> bool ; fn run ( & mut self ) { while self . step () { } }

macro_rules ! unwrap_or_return ( ( $ opt : expr , $ retval : expr ) => ( match $ opt { None => return $ retval , Some ( x ) => x , } ) ) macro_rules ! get_char ( ( $ me : expr ) => ( unwrap_or_return ! ( $ me . get_char (), false ) ) )

macro_rules ! shorthand ( ( $ me : expr : emit $ c : expr ) => ( $ me . emit_char ( $ c ); ); )

Allows for compact sequencing:

go ! ( self : error ; create_doctype ; force_quirks ; emit_doctype ; to Data )

go! is used over 200 times.

A pattern like $($cmd:tt)* ; $($rest:tt)* is ambiguous :(

macro_rules ! go ( ( $ me : expr : $ a : tt ; $( $ rest : tt ) * ) => ({ shorthand ! ( $ me : $ a ); go ! ( $ me : $( $ rest ) * ); }); ( $ me : expr : $ a : tt $ b : tt ; $( $ rest : tt ) * ) => ({ shorthand ! ( $ me : $ a $ b ); go ! ( $ me : $( $ rest ) * ); }); ( $ me : expr : $ a : tt $ b : tt $ c : tt ; $( $ rest : tt ) * ) => ({ shorthand ! ( $ me : $ a $ b $ c ); go ! ( $ me : $( $ rest ) * ); });

( $ me : expr : to $ s : ident ) => ({ $ me . state = states :: $ s ; return true ; }); ( $ me : expr : $( $ cmd : tt ) + ) => ( shorthand ! ( $ me : $( $ cmd ) + ); ); ( $ me : expr : ) => (()); )

We're already stretching the limits of macro_rules! and we haven't touched tree construction…

Procedural macros run arbitrary Rust code at compile time,

using rustc 's plugin infrastructure

See doc.rust-lang.org/guide-plugin.html

< parses as "<"

∮ parses as "∮"

WHATWG publishes about 2,000 of these as JSON

pub static NAMED_ENTITIES : PhfMap < & 'static str , [ u32 , .. 2 ] > = named_entities ! ( "data/entities.json" );

let map : HashMap < String , [ u32 , .. 2 ] > = ...; let toks : Vec < _ > = map . into_iter (). flat_map ( | ( k , [ c0 , c1 ]) | { let k = k . as_slice (); ( quote_tokens ! ( & mut * cx , $ k => [ $ c0 , $ c1 ], )). into_iter () } ). collect (); MacExpr :: new ( quote_expr ! ( & mut * cx , phf_map ! ( $ toks ) ))

We use another procedural macro, from sfackler's rust-phf library, to generate a perfect hash table at compile time.

phf_map ! ( k => v , k => v , ...)

Tree builder has its own rules, less regular in form than the tokenizer.

Instead of match + go! , we'll need a procedural macro.

match mode { InBody => match_token ! ( token { tag @ < / a > < / b > < / big > < / code > < / em > < / font > < / i > < / nobr > < / s > < / small > < / strike > < / strong > < / tt > < / u > => { self . adoption_agency ( tag . name ); Done } tag @ < h1 > < h2 > < h3 > < h4 > < h5 > < h6 > => { self . close_p_element_in_button_scope (); if self . current_node_in ( heading_tag ) {

struct Tag { kind : TagKind , name : Option < TagName > , } enum LHS { Pat ( P < ast :: Pat > ), Tags ( Vec < Spanned < Tag >> ), }

In syntax::codemap you will find

pub struct Span { pub lo : BytePos , pub hi : BytePos , pub expn_info : Option < P < ExpnInfo >> } pub struct Spanned < T > { pub node : T , pub span : Span , }

use syntax :: codemap ::{ Span , Spanned , spanned }; use syntax :: parse :: parser :: Parser ; fn parse_spanned_ident ( parser : & mut Parser ) -> Spanned < Ident > { let lo = parser . span . lo ; let ident = parser . parse_ident (); let hi = parser . last_span . hi ; spanned ( lo , hi , ident ) }

macro_rules ! bail ( ( $ cx : expr , $ span : expr , $ msg : expr ) => ({ $ cx . span_err ( $ span , $ msg ); return :: syntax :: ext :: base :: DummyResult :: any ( $ span ); }) ) macro_rules ! bail_if ( ( $ e : expr , $ cx : expr , $ span : expr , $ msg : expr ) => ( if $ e { bail ! ( $ cx , $ span , $ msg ) } ) )

match ( lhs . node , rhs . node ) { ( Pat ( pat ), Expr ( expr )) => { bail_if ! ( ! wildcards . is_empty (), cx , lhs . span , "ordinary patterns may not appear after \ wildcard tags" );

Do this to guarantee the semantics of in-order matching

src/tree_builder/rules.rs:100:17: 100:48 error: ordinary patterns may not appear after wildcard tags src/tree_builder/rules.rs:100 CharacterTokens(NotSplit, text) => SplitWhitespace(text), ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: aborting due to previous error