Instrumenting Total Parsers Written in agdarsec

A common request I have received from (potential) users of my library of total parser combinators is to add support for error reporting. This is the story of adding such a machinery with minimal changes to the combinators themselves.

# The (Idealized) Type of Parsers

The type of total parser combinators in agdarsec is essentially a fancy version of the classic Haskell type Parser a = String -> [(String, a)] . The added complexity has two goals: to enforce totality and to give the user more configuration options.

# Parameters of the Parser.

If the support for totality is baked in the representation [1], the configuration options are granted by a set of parameters.

record Parameters : Set₁ where field Tok : Set Toks : ℕ → Set M : Set → Set

These parameters allow the user to specify a type of tokens Tok the parser is working on (for instance Char ), the type of the sized input Toks (typically something like Vec Tok ) and a monad M . The fact that the input is sized is essential for proving termination but is largely irrelevant from the end-user's point of view. In effect the parser type is morally equivalent to Parser a = Toks → M (a, Toks) where the size of the Toks in the output is smaller than that of the one in the input.

These definitions are already enough to define a wealth of parser combinators ranging from conjunction, disjunction, left or right associated chains, etc. all relying on the properties of M (e.g. Alternative M gives us the ability to defined the disjunction combinator) to realise the expected behaviour. For unambiguous grammars a simple Maybe -based solution worsk really well.

However by using Maybe as our monad, whenever a parse fails the user does not get any information about the failure. It makes debugging the parser (or the expression being parsed) quite hard. This is where instrumentation comes in.

# Instrumenting a Parser

Most of the instrumentation can be done by picking a more appropriate stack of monad transformers for M . Keeping track of the position of the parser in the original source file is the one thing that needs to be hard-wired by extending the set of Parameters of a Parser . We add one new field: recordToken which is to be used every time a token is read from the input.

record Parameters : Set₁ where field (...) recordToken : Tok → M ⊤

The only change to the combinators in the library is in anyToken , the function which grabs the first token on the input (if it exists). It now performs an extra operation before returning the grabbed value: it uses our new recordToken to remember the what was obtained.

# Position Tracking

Now that we record every token read from the input, we can start tracking the current position in the source file. This can be used to associate a position to each nodes of the Abstract Syntax Tree our parser produces which is useful if we are for instance writing a compiler and want to be able to report precise errors to users further down the line. We model positions with two natural numbers corresponding to the line and character offset respectively.

record Position : Set where field line : ℕ offset : ℕ

Each character yields an action on positions: if it is a newline character then we increment the line count and reset the offset one; otherwise we simply increment the offset.

next : Char → Position → Position next c p = if c == '

' then record { line = suc (line p) ; offset = 0 } else record p { offset = suc (offset p) }

By using a StateT Position transformer in our monad stack and making sure that recordToken updates the state by calling next , we now have extended character-based parsers to be position-aware.

Not only can this be used to produce ASTs recording the source file positions corresponding to each of their nodes, it also allows us to report parsing errors more accurately by returning the position at the point of failure.

# Distinguishing Soft and Hard Failures

Simply using StateT Position (Position ⊎_) as our monad does not quite yield the behaviour we expect. If a parser cannot succeed on a source file, it will always return as its position at the time of failure the one corresponding to the failure in the rightmost path in our grammar. Indeed at each disjunction, a failure in the left branch is recovered from by trying the right branch. However this is often not what we want.

For instance if we were to parse numbers written in either binary or hexadecimal format, we could describe the parser as a disjunction between a parser for strings of the form 0b[01]+ and one for strings of the form 0x[0-9A-F]+ . For the input "0b0000002" the parser would:

take the first branch and fail upon finding '2'

backtrack to the disjunction

take the second branch and fail upon finding 'b'

finish with an error reporting failure at the position of 'b'

Ideally we would like to bypass completely the second branch because having read "0b", we know that it can only fail. And we would like the error location to be that of '2', not of 'b' because we feel like it's a fairer assessment of what the problem is.

In other words: after having read a disambiguating token we sometimes want to commit to one branch of the disjunction, ignoring the alternatives. We can do this by distinguishing soft failures (from which we can recover) from hard failures (which are fatal).

data Result (A : Set) : Set where HardFail : Error → Result A SoftFail : Error → Result A Value : A → Result A

The Alternative instance for Result is similar to that of A ⊎_ : if the first computation has failed then we try the second one. The only difference here is that we only try the other branch if we are facing a soft failure. Otherwise we propagate the error.

_∣_ : ∀ {A} → Result A → Result A → Result A ma₁ ∣ ma₂ = case ma₁ of λ where (SoftFail _) → ma₂ r → r

This distinction between soft and hard failures gives us the ability to commit to one branch by turning soft failures into hard ones.

commit : ∀ {A} → Result A → Result A commit ma = case ma of λ where (SoftFail e) → HardFail e r → r

This Result type, the associated alternative instance and the use of commit bring two benefits to the table:

the parser finishes faster when it fails because it ignores all the alternative branches

the parser returns a more relevant error to the end-user rather than the one found in the last of the alternatives

It is important to note that we get all of this without having to modify the parser combinators; the setup is generic enough that end-users can drastically alter a parser's behaviour and the amount of information it returns simply by switching to a different monad stack. We can still think of alternatives such as trying the second branch of the disjunction and in case it also fails use a heuristics to decide which error to propagate.

We have seen how to add instrumentation to parsers written using total parser combinators by changing the one combinator which pulls tokens directly from the input and picking an appropriate monad stack. You can see these design decisions in use in the STLC parser in agdarky, a project bringing together some of my libraries to write software entirely in Agda.

This work would not have been possible without discussions with @clayrat on the design, implementation and testing of the instrumentation of agdarsec and their port of the changes to the Idris version of the library (idris-tparsec).

Last update: 2019 03