blog | oilshell.org

ASDL Implementation and Use Cases

Yesterday I recapped the project priorities, then showed an example ASDL schema and data structure.

Today I'll describe oil's ASDL implementation, then explain the role it will play in each of the three priorities.

ASDL Implementation

I think ASDL is a big win so far, but the implementation is quite short:

ASDL 433 asdl/asdl_.py 271 asdl/encode.py 309 asdl/py_meta.py 313 asdl/format.py 471 asdl/gen_cpp.py 1797 total

I copied asdl_.py from Python's ASDL implementation. It uses a simple regex-based lexer and recursive descent parser to turn the input schema (e.g. osh.asdl) into tree of Python objects. That is, it produces the AST for the ASDL schema itself.

The other four files implement four recursive algorithms. Two of them walk the AST:

py_meta.py dynamically generates classes from the schema using metaprogramming, e.g. Python's type function. It creates a class for each product type and sum type (e.g. the token and word types in yesterday's example).

gen_cpp.py generates C++ code that can read a binary encoding of ASTs, called oheap . For example, the 129 lines in osh.asdl produces ~1100 lines of C++.

The other two algorithms operator on both the schema and the data, e.g. an oil AST for a particular program:

encode.py walks an instance of an AST in Python and encodes it in the oheap format.

walks an instance of an AST in Python and encodes it in the format. format.py is similar, but instead of binary data, it produces the text format that we saw at the end of yesterday's post.

I'll discuss oheap in a subsequent post. For now, think of it as a binary version of the text representation.

So it's possible to write a shell interpreter with a front end in Python and a back end in C++ like this:

[Lines of Code] -> Lexer (osh/lex.py) -> [Tokens] -> Four Parsers (osh/*_parse.py) -> [Tree of Python ASDL objects] -> Encoder (asdl/encode.py) -> [oheap encoding of the tree] -> Decoder (generated osh.asdl.h) -> Runtime in C++

(Above, data is surrounded by brackets and code is surrounded by arrows.)

But oil will likely use another stage, as discussed below.

Multiple Tree Representations

All shells I've looked at (including bash, dash, zsh, mksh) are tree interpreters. That is, after the parser builds the AST, the executor just traverses it and makes system calls like fork() , exec() , pipe (), and dup() . In contrast, Python is a bytecode interpreter.

If I were just writing a shell, I would write a tree interpreter as well. But oil is an interpreter with two languages: the compatible osh language and the new oil language.

And now that I'm closer to tackling this problem, it looks more complex. I naively thought that I could just compile both languages to the same AST, and use that same AST to convert shell source to oil source.

I think I'll need multiple representations of the code, for two reasons:

Because ASTs are lossy. The whole point is to abstract away some details of the source code, but for source conversion we want a lossless representation of the code. (This is not quite a "parse tree", because it's been simplified even though it's lossless.) Because oil is a superset of the shell language. For example, it's dynamically typed rather than "unityped". It doesn't make sense to use the same AST for two different languages, even if one is inspired by the other.

I'll write more about this when I've actually implemented it, because I'm sure my thoughts will evolve.

For example, I wrote this post before I implemented ASDL and used it to describe osh . You can see that I understood that the three use cases would impose different requirements on "the AST". But I didn't yet realize that oil needs need multiple tree representations.

Luckily, ASDL is perfect for this, and it's no accident. Compilers written in ML are typically composed of many small stages connected by typed trees (algebraic data types). The types enforce properties of the code at each stage.

Immediately after integrating ASDL, it caught a lot of bugs in oil , which I quickly fixed. Surprisingly, dynamic type checking plus a good test suite is a decent replacement for static type checking, at least for our prototype.

(Update on MyPy: Type checking with MyPy appears to be a lost cause, since ASDL uses extensive metaprogramming. But I wanted ML's type system and not MyPy's Java-like type system.)

I plan to use three AST schemas:

osh.asdl — A lossless representation of shell source, already written. oil.asdl — A lossless representation of oil source, not yet written. ovm.asdl — A simpler language/tree that both osh and oil compile to, which is easily executed by C++.

Tackling the Top Priorities

Again, the top three project priorities are:

Test the oil language design by converting shell scripts in the wild to it. Fill out the shell executor and runtime in Python. Write a production quality executor and runtime in C++.

ASDL will help with each of these tasks:

As mentioned, osh.asdl will be a lossless representation — that is, we should be able to reconstruct the shell source from it. So we can also use it to convert any shell script to oil .

The main blocker is to clean up the source location schema, so we can preserve whitespace and comments. Right now I'm attaching a line_span object to each token, but this information should be propagated up to the node level.

The Python runtime has substantial functionality and can be easily extended. I hope to get help on this, but I need to get the spec tests in better shape before accepting contributions. Let me know in the comments if you're interested in contributing, and I'll increase the priority of this task! The C++ executor is blocked on designing ovm.asdl , which is itself blocked on validating the oil language design. To design ovm , we need to have a good grasp on the languages that compile to it.

So this is further off, but I'm pleased to have implemented the oheap format, which makes progress toward this goal.

Conclusion

We talked about four tree-walking algorithms in oil's ASDL implementation, and then explained how ASDL helps with the top three priorities. And I suspect that we will have three ASDL schemas: osh , oil , and ovm .

I have one more post on the ASDL implementation, and then I'll share some thoughts that came from the experience of implementing ASDL and using it in oil .

For one, I'm surprised that ASDL seems to be uncommon. It's used in one of most popular languages in the world, but there are only a couple blog posts about it, and not too many modern successors.

Not to jump forward too much, but I want oil to be a shell for distributed computing, and schema languages are widely used in that domain. So this lesson from oil's implementation may inform the design of the oil language itself.