By the end of part one of this series we had a rudimentary document object model for representing only a tiny subset of the universe of possible HTML documents. This installment, modelled after the second post in Matt Brubeck’s series, implements an equally modest HTML parser that will build instances of the model we defined last time.

The HTML Parser

Our parser, just like our model, directly mirrors the design of Matt’s implementation in Rust. The primary challenge is just one of translating between two languages. Open Matt’s post in another browser window to compare the Swift code below to the original Rust.

The first thing we need is a type to represent parsing state: the input string and the index of the next unprocessed character. Immediately this gets us into the differences between how Rust and Swift implement strings. Both languages need to handle the complexity of Unicode while being as efficient and safe as possible. In Rust our index was a uint that indexes into an &str slice created from the String . In Swift, we can index directly into the String, but our index type is a String.Index struct with a succesor() method that handles the complexity of stepping from character to character without accidentally stepping into the middle of one that has a multi-byte representation.

1 2 3 4 struct Parser { var pos: String.Index let input: String }

Notice that since the input string doesn’t change, we can use a let binding for it. The index, on the other hand, changes as we parse the document, which requires us to use var . Now that we have this type defined, we can implement a few foundational methods.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 extension Parser { // Read the current Character without consuming it. func nextCharacter() -> Character { return input[self.pos] } // Does the current input start with the given string? func startsWith(s: String) -> Bool { return self.input.substringFromIndex(self.pos).hasPrefix(s) } // Return true if all input is consumed. func eof() -> Bool { return self.pos >= self.input.endIndex } // ... }

These methods look very much like the Rust equivalents. The most obvious difference is that the references to self are explicitly declared in Rust but are implicit in Swift.

Next we come to the place where the successor() method of String.Index allows us to safely advance to the next Unicode Character . This is the first method that mutates parsing state. Notice how the &mut self in the Rust implementation is replaced by the mutating keyword in Swift.

1 2 3 4 5 6 // Return the current Character, and advance self.pos to the next Character. mutating func consumeCharacter() -> Character { let result = input[self.pos] self.pos = self.pos.successor() return result }

The next method is the heart of the parser. It consumes characters one by one for as long as the predicate function evaluates to true .

1 2 3 4 5 6 7 8 // Consume Character until `test` returns false. mutating func consumeWhile(test: Character -> Bool) -> String { var result = "" while !self.eof() && test(self.nextCharacter()) { result.append(consumeCharacter()) } return result }

At this point in Matt’s post, we would be defining a pair of methods using the one above. Both of these methods, however, want to pass predicates that test character set membership. This is something that required a little gymnastics due to an apparent impedence mismatch between the native Swift Character type and the NSCharacterSet class, whose characterIsMember method requires a unichar . I was unable to find an easy way to convert one to another directly. Fortunately, NSString was a useful stepping stone between the two.

1 2 3 4 5 6 7 8 extension Character { func isMemberOf(set: NSCharacterSet) -> Bool { let bridgedCharacter = (String(self) as NSString).characterAtIndex(0) return set.characterIsMember(bridgedCharacter) } }

Using this workaround we can implement the desired methods:

1 2 3 4 5 6 7 8 9 // Consume and discard zero or more whitespace Character. mutating func consumeWhitespace() { self.consumeWhile( {$0.isMemberOf(NSCharacterSet.whitespaceCharacterSet()) }) } // Parse a tag or attribute name. mutating func parseTagName() -> String { return self.consumeWhile( {$0.isMemberOf(NSCharacterSet.alphanumericCharacterSet()) }) }

Now we finally get to finally write some functions that produce nodes in our object model.

1 2 3 4 5 6 7 8 9 10 11 12 // Parse a single node. mutating func parseNode() -> Node { switch self.nextCharacter() { case "<": return self.parseElement() case _: return self.parseText() } } // Parse a text node. mutating func parseText() -> Node { return Node(data: self.consumeWhile({$0 != "<" })) }

The function that produces a Node that represents an HTML element looks like the Rust equivalent. I like the assertions, so I kept them even though I worry that this code will fail to consume the characters in release builds. I have yet to test that. (Edit: this does turn out to be a problem. Thanks to /u/teequ for confirming that.)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 // Parse a single element, including its open tag, contents, and closing tag. mutating func parseElement() -> Node { // Opening tag. assert(self.consumeCharacter() == "<") let tagName = self.parseTagName() let attrs = self.parseAttributes() assert(self.consumeCharacter() == ">") // Contents. let children = self.parseNodes() // Closing tag. assert(self.consumeCharacter() == "<") assert(self.consumeCharacter() == "/") assert(self.parseTagName() == tagName) assert(self.consumeCharacter() == ">") return Node(name: tagName, attrs: attrs, children: children) }

This is also a possible concern in two of the three methods for parsing HTML element attributes and values.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 // Parse a single name="value" pair. mutating func parseAttr() -> (String, String) { let name = self.parseTagName() assert(self.consumeCharacter() == "=") let value = self.parseAttrValue() return (name, value) } // Parse a quoted value. mutating func parseAttrValue() -> String { let openQuote = self.consumeCharacter() assert(openQuote == "\"" || openQuote == "'") let value = self.consumeWhile( {$0 != openQuote} ) assert(self.consumeCharacter() == openQuote) return value } // Parse a list of name="value" pairs, separated by whitespace. mutating func parseAttributes() -> AttrMap { var attributes: AttrMap = [:] while (true) { self.consumeWhitespace() if self.nextCharacter() == ">" { break } let (name, value) = self.parseAttr() attributes[name] = value } return attributes }

Before we can write the top-level method that parses an entire document, we need the method that parses a set of sibling nodes.

1 2 3 4 5 6 7 8 9 10 11 12 // Parse a sequence of sibling nodes. mutating func parseNodes() -> [Node] { var nodes: [Node] = [] while (true) { self.consumeWhitespace() if self.eof() || self.startsWith("</") { break } nodes.append(self.parseNode()) } return nodes }

Using the above, we can write the final method to parse a document.

1 2 3 4 5 6 7 8 9 10 11 // Parse an HTML document and return the root element. public func parse(source: String) -> Node { var parser = Parser(pos: source.startIndex, input: source) let nodes = parser.parseNodes() // If the document contains a root element, just return it. Otherwise, create one. if nodes.count == 1 { return nodes[0] } else { return Node(name: "html", attrs: [:], children: nodes) } }

The Crow project also contains some unit tests to exercise this, which means we have finally arrived at a place where we can pull the trigger and see something happen.

If you have any code review to offer, please leave a comment in /r/swift.