A taste of Rust

This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

Rust, the new programming language being developed by the Mozilla project, has a number of interesting features. One that stands out is the focus on safety. There are clear attempts to increase the range of errors that the compiler can detect and prevent, and thereby reduce the number of errors that end up in production code.

There are two general ways that a language design can help increase safety: it can make it easier to write good code and it can make it harder to write bad code. The former approach has seen much success over the years as structured programming, richer type systems, and improved encapsulation have made it easier for a programmer to express their goals using high-level concepts. The latter approach has not met with quite such wide success, largely because it requires removing functionality that easily leads to bad code. The problem there, of course, is that potentially dangerous functionality can also be extremely useful. The classic example is the goto statement, which has long been known to be an easy source of errors, but which most major procedural languages still include as they would quickly hit difficulties without it (though Rust, like Python, has avoided adding goto so far).

Nevertheless it is this second approach that Rust appears to focus on. A number of features that are present in languages like C, and some that are still present in more safety-conscious languages like Java, are disallowed or seriously restricted in Rust. To compensate, Rust endeavors to be a sufficiently rich language that those features are not necessary. One way to look at this focus is that it endeavors to produce compile-time errors where other languages would generate runtime errors (or core dumps). This not only reduces costs at runtime, but should decrease the number of support calls from customers.

A large part of this richness is embodied in the type system. Rust has many of the same sorts of types that other languages do (structures, arrays, pointers, etc.), but they are often, as we shall see, somewhat narrower or more restricted than is common. Coupled with this are various ways to broaden a type, in a controlled way, so that its shape fits more closely to what the programmer really wants to say. And for those cases where Rust turns out not to be sufficiently rich, there is a loophole.

Rigidly defined areas of doubt and uncertainty

In a move that strongly echos the "System" module in Modula-2 (which provides access to so-called "low-level" functionality), Rust allows functions and code sections to be declared as "unsafe". Many of the restrictions that serve to make Rust a safer language do not apply inside "unsafe" code.

This could be viewed as an admission of failure to provide a truly safe language, however that view would be overly simplistic. In the first place, Gödel's incompleteness theorem seems to suggest that any programming language which is rich enough to say all that we might want to say is also so rich that it cannot possibly be considered to be "safe". But on a more practical level, a moment's thought will show that code inside the language compiler itself is deeply "unsafe" in the context of the program that it generates. A bug there could easily cause incorrect behavior without any hope of a compiler being able to detect it. So complete safety is impossible.

Given that the reality of "unsafe" code is unavoidable, it is not unreasonable to allow some of it to appear in the target program, or in libraries written in the target language, instead of all of it being in the compiler. This choice to have the language be safe by default allows most code to be safe while providing clearly defined areas where unchecked code is allowed. This, at the very least, makes it easier to identify which areas of code need particular attention in code reviews.

If we look at the Rust compiler, which itself is written in Rust, we find that of the 111 source code files (i.e. the src/librustc directory tree, which excludes the library and runtime support), 33 of them contain the word "unsafe". 30% of the code being potentially unsafe may seem like a lot, even for a compiler, but 70% being guaranteed not to contain certain classes of errors certainly sounds like a good thing.

Safer defaults

As we turn to look at the particular ways in which Rust encourages safe code, the first is not so much a restriction as a smaller scale version of the choice to default to increased safety.

While many languages have keywords like "const" (C, Go) or "final" (Java) to indicate that a name, once bound, will always refer to the same value, Rust takes the reverse approach. Bindings are immutable by default and require the keyword "mut" to make them mutable, thus:

let pi = 3.1415926535;

pi

let mut radius = 1.496e11;

radius

mut

will never allow the value ofto change while:will allowto be altered later. The choice of, which encourages the vowel to sound short, in contrast to the long vowel in "mutable", may be unfortunate, but encouraging the programmer to think before allowing something to change is probably very wise.

For some data structures, the "mutable" setting is inherited through references and structures to member elements, so that if an immutable reference is held to certain objects, the compiler will ensure that the object as a whole will remain unchanged.

Pointers are never NULL

Pointer errors are a common class of errors in code written in C and similar languages, and many successors have taken steps to control these by limiting or forbidding pointer arithmetic. Rust goes a step further and forbids NULL (or nil) pointers as well, so that any attempt to dereference a pointer will result in finding a valid object. The only time that a variable or field does not contain a valid pointer is when the compiler knows that it does not, so a dereference will result in a compile-time error.

There are of course many cases where "valid pointer or NULL" is a very useful type, such as in a linked list or binary tree data structure. For these cases, the Rust core library provides the "Option" parameterized type.

If T is any type, then Option<T> is a variant record (sometime called a "tagged union" but called an "enum" in Rust), which contains a value of type T associated with the tag "Some", or it contains just the tag "None". Rust provides a "match" statement, which is a generalization of "case" or "typecase" from other languages, that can be used to test the tag and access the contained value if it is present.

Option is defined as:

pub enum Option<T> { None, Some(T), }

struct element { value: int, next: Option<~element>, }

element

~

@

so:could be used to make a linked list of integers. The tilde in front ofindicates a pointer to the structure, similar to an asterisk in C. There are different types of pointers which will be discussed later — tilde () introduces an "owned" pointer while the at sign () introduces a "managed" pointer.

Thus null-able pointers are quite possible in Rust, but the default is the safer option of a pointer that can never be NULL, and when a null-able pointer is used, it must always be explicitly tested for NULL before that use. For example:

match next { Some(e) => io::println(fmt!("%d", next.value)), None => io::println("No value here"), }

In contrast to this, array (or, as Rust calls them, "vector") references do not require the compiler to know that the index is in range before that index is dereferenced. It would presumably be possible to use the type system to ensure that indexes are in range in many cases, and to require explicit tests for cases where the type system cannot be sure — similar to the approach taken for pointers.

Thus, while it is impossible to get a runtime error for a NULL pointer dereference, it is quite possible to get a runtime error for an array bounds error. It is not clear whether this is a deliberate omission, or whether it is something that might be changed later.

Parameterized Generics

Unsurprisingly, Rust does not permit "void" pointers or down casts (casts from a more general type, like void, to a more specific type). The use case that most commonly justifies these — the writing of generic data structures — is supported using parameterized types and parameterized functions. These are quite similar to Generic Classes in Java and Templates in C++. Generics in Rust have a simple syntax and any individual non-scalar type and any function or method can have type parameters.

A toy example might be:

struct Pair<t1,t2> { first: t1, second: t2, } fn first<t1:Copy,t2:Copy>(p: Pair<t1,t2>) -> t1 { return copy p.first; }

let foo = Pair { first:1, second:'2'};

foo

let foo:Pair<int, char> = Pair { first:1, second:'2'};

first(foo);

int

int

We can then declare a variable:noting that the type ofis not given explicitly, but instead is inferred from the initial value. The declaration could read:but that would be needlessly verbose. We can then calland the compiler will know that this returns an. In fact it will return a copy of an, not a reference to one.

Rust allows all type declarations (struct, array, enum), functions, and traits (somewhat like "interfaces" or "virtual classes") to be parameterized by types. This should mean that the compiler always knows enough about the type of every value to avoid the need for down casts.

Keep one's memories in order

The final safety measure from Rust that we will examine relates to memory allocation and particularly how that interacts with concurrency.

Like many modern languages, Rust does not allow the programmer to explicitly release (or "free" or "dispose" of) allocated memory. Rather, the language and runtime support take care of that. Two mechanisms are provided which complement the inevitable "stack" allocation of local variables. Both of these mechanisms explicitly relate to the multiprocessing model that is based on "tasks", which are somewhat like "goroutines" in Go, or Green threads. They are lightweight and can be cooperatively scheduled within one operating system thread, or across a small pool of such threads.

The first mechanism provides "managed" allocations which are subject to garbage collection. These will be destroyed (if a destructor is defined) and freed sometime after the last active reference is dropped. Managed allocations are allocated on a per-task heap and they can only ever be accessed by code running in that task. When a task exits, all allocations in that task's heap are guaranteed to have been freed. The current implementation uses reference counting to manage these allocations, though there appears to be an intention to change to a mark/sweep scheme later.

The second mechanism involves using the type system to ensure that certain values only ever have one (strong) reference. When that single reference is dropped, the object is freed. This is somewhat reminiscent of the "Once upon a type" optimization described by David Turner et al. for functional languages. However, that system automatically inferred which values only ever had a single reference, whereas Rust requires that this "once" or "owned" (as Rust calls it) annotation be explicit.

These single-reference allocations are made on a common heap and can be passed between tasks. As only one reference is allowed, it is not possible for two different tasks to both access one of these owned objects, so concurrent access, and the races that so often involves, are impossible.

May I borrow your reference?

Owned values can have extra references providing they are "borrowed" references. Borrowed references are quite weak and can only be held in the context of some specific strong reference. As soon as the primary reference is dropped or goes out-of-scope, any references borrowed from it become invalid and it is a compile-time error if they are used at all.

It is often convenient to pass borrowed references to functions and, more significantly, to return these references from functions. This is particularly useful as each of owned, managed, and on-stack references can equally be borrowed, so passing a borrowed reference can work independently of the style of allocation.

The passing of borrowed references is easily managed by annotating the type of the function parameter to say that the reference is a borrowed reference. The compiler will then ensure nothing is done with it that is not safe.

Returning borrowed references is a little trickier as the compiler needs to know not only that the returned reference is borrowed, but also needs to know something about the lifetime of the reference. Otherwise it cannot ensure that the returned reference isn't used after the primary reference has been dropped.

This lifetime information is passed around using the same mechanism as type-parameters for functions. A function can be declared with one or more lifetime parameters (introduced by a single quote: ' ), and the various arguments and return values can then be tagged with these lifetimes. Rust uses type inference on the arguments to determine the actual values of these parameters and thus the lifetime of the returned value. For example, the function choose_one declared as:

fn choose_one<'r>(first: &'r Foo, second: &'r Foo) -> &'r Foo

r

first

second

&

has one lifetime parameter () which applies to both arguments (and), and to the result. The ampersand () here indicates a borrowed reference.

When it is called, the lifetime parameter is implicit, not explicit:

baz = choose_one(bar, bat);

r

bar

bat

baz

The compiler knows that the lifetime ofmust be compatible with the lifetimes ofandso it effectively computes the intersection of those. The result is associated withas its lifetime. Any attempt to use it beyond its lifetime will cause a compile-time error.

Put another way, Rust uses the lifetime parameter information to infer the lifetime of the function's result, based on the function's arguments. This is similar to the inference of the type of foo in our Pair example earlier. A more thorough explanation of these lifetimes can be found in the "Borrowed Pointers Tutorial" referenced below.

This extra annotation may feel a little clumsy, but it seems a relatively small price to pay for the benefit of being able to give the compiler enough information that it can verify all references are valid.

Time for that loophole

Allowing objects to be allocated on the stack, in a per-task heap, or in a common heap providing there is only ever one (strong) reference ensures that there will be no errors due to races with multiple tasks accessing the same data, but the price seems a little high to pay — it seems to mean that there can be no sharing of data between tasks at all.

The key to solving this conundrum is the "unsafe" code regions that were mentioned earlier. Unsafe code can allocate "raw" memory which has no particular affiliation or management, and can cast "raw" pointers into any other sort of pointer.

This allows, for example, the creation of a "queue" from which two owned references could be created, one of a type that allows references to be added to the queue, one which allows references to be removed. The implementation of this queue would require a modest amount of "unsafe" code, but any client of the queue could be entirely safe. The end points might be passed around from task to task, can be stored in some data structure, and can be used to move other owned references between tasks. When the last reference is dropped the "queue" implementation will be told and can respond appropriately.

The standard library for Rust implements just such a data structure, which it calls a "pipe". Other more complex data structures could clearly be implemented to allow greater levels of sharing, while making sure the interface is composed only of owned and managed references, and thus is safe from unplanned concurrent access and from dangling pointer errors.

Is safety worth the cost?

Without writing a large body of code in Rust it is impossible to get a good feel for how effective the various safety features of the language really are. Having a general ideological preference for strong typing, I expect I would enjoy the extra expressiveness and would find the extra syntax it requires to be a small price to pay. The greatest irritation would probably come from finding the need to write too much (or even any) unsafe code, or from having to give up all safety when wanting to sacrifice any.

Of course, if that led to practical ideas for making the language able to make richer assertions about various data, or richer assertions about the level of safety, and could thus enlarge the scope for safe code and decreasing the need for unsafe code, then the irritation might well result in some useful pearls.

After all, Rust is only at release "0.6" and is not considered to be "finished" yet.

References