Little things that matter in language design

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

The designers of a new programming language are probably most interested in the big features — the things that just couldn't be done with whichever language they are trying to escape from. So they are probably thinking of the type system, the data model, the concurrency support, the approach to polymorphism, or whatever it is that they feel will affect the expressiveness of the language in the way they want.

There is a good chance they will also have a pet peeve about syntax, whether it relates to the exact meaning of the humble semicolon, or some abhorrent feature such as the C conditional expression which (they feel) should never be allowed to see the light of day again. However, designing a language requires more than just addressing the things you care about. It requires making a wide range of decisions concerning various sorts of abstractions, and making sure the choices all fit together into a coherent, and hopefully consistent, whole.

One might hope that, with over half a century of language development behind us, there would be some established norms which can be simply taken as "best practice" without further concern. While this is true to an extent, there appears to be plenty of room for languages to diverge even on apparently simple concepts.

Having begun an exploration of the relatively new languages Rust and Go and, in particular, having two languages to provide illuminating contrasts, it seems apropos to examine some of those language features that we might think should be uncontroversial to see just how uniform they have, or have not, become.

Comments

When first coming to C [PDF] from Pascal, the usage of braces can be a bit of a surprise. While Pascal sees them as one option for enclosing comments, C sees them as a means of grouping statements. This harsh conflict between the languages is bound to cause confusion, or at least a little friction, when moving from one language to the next, but fortunately appears to be a thing of the past.

One last vestige of this sort of confusion can be seen in the configuration files for BIND, the Berkeley Internet Name Daemon. In the BIND configuration files semicolons are used as statement terminators while in the database files they introduce comments.

When not hampered by standards conformance as these database files are, many languages have settled on C-style block comments:

/* This is a comment */

and C++-style one-line comments:

// This line has a comment

these having won over from the other Pascal option of:

(* similar but different block comments *)

and Ada's:

-- again a similar yet different single line comment.

The other popular alternative is to start comments with a "#" character, which is a style championed by the C-shell and Bourne shell, and consequently used by many scripting languages. Thankfully the idea of starting a comment with "COMMENT" and ending with "TNEMMOC" never really took off and may be entirely apocryphal.

Both Rust and Go have embraced these trends, though not as fully as BIND configuration files and other languages like Crack which allow all three ( /* */ , // , # ). Rust and Go only support the C and C++ styles.

Go doesn't use the "#" character at all, allowing it only inside comments and string constants, so it is available as a comment character for a future revision, or maybe for something else.

Rust has another use for "#" which is slightly reminiscent of its use by the preprocessor in C. The construct:

#[attribute....]

attaches arbitrary metadata to nearby parts of the program which can enable or disable compiler warnings, guide conditional compilation, specify a license, or any of various other things.

Identifiers

Identifiers are even more standard than comments. Any combination of letters, digits, and the underscore that does not start with a digit is usually acceptable as an identifier providing it hasn't already been claimed as a reserved word (like if or while ).

With the increasing awareness of languages and writing systems other than English, UTF-8 is more broadly supported in programming languages these days. This extends the range of characters that can go into an identifier, though different languages extend it differently.

Unicode defines a category for every character, and Go simply extends the definition given above to allow "Unicode letter" (which has 5 sub-categories: uppercase, lowercase, titlecase, modifier, and other) and "Unicode decimal digit" (which is one of 3 sub-categories of "Number", the others being "Number,letter" and "Number,other") to be combined with the underscore. The Go FAQ suggests this definition may be extended depending on how standardization efforts progress.

Rust gives a hint of what these efforts may look like by delegating the task of determining valid identifiers to the Unicode standard. The Unicode Standard Annex #31 defines two character classes, "ID_Start" and "ID_Continue", that can be used to form identifiers in a standard way. The Annex offers these as a resource, rather than imposing them as a standard, and acknowledges that particular use cases may extend them is various ways. It particularly highlights that some languages like to allow identifiers to start with an underscore, which ID_Start does not contain. The particular rule used by Rust is to allow an identifier to start with an ASCII letter, underscore, or any ID_Start, and to be continued with ASCII letters, ASCII digits, underscores, or Unicode ID_Continue characters.

Allowing Unicode can introduce interesting issues if case is significant, as Unicode supports three cases (upper, lower, and title) and also supports characters without case. Most programming languages very sensibly have no understanding of case and treat two characters of different case as different characters, with no attempt to fold case or have a canonical representation. Go however does pay some attention to case.

In Go, identifiers where the first character is an uppercase letter are treated differently in terms of visibility between packages. A name defined in one package is only exported to other packages if it starts with an uppercase letter. This suggests that writing systems without case, such as Chinese, cannot be used to name exported identifiers without some sort of non-Chinese uppercase prefix. The Go FAQ acknowledges this weakness but shows a strong reluctance to give up the significance of case in exports.

Numbers

Numbers don't face any new issues with Unicode though possibly that is just due to continued English parochialism, as Unicode does contain a complete set of Roman numerals as well as those from more current numeral systems. So you might think that numbers would be fairly well standardized by now. To a large extent they are, but there still seems to be wiggle room.

Numbers can be integers or, with a decimal point or exponent suffix (e.g. "1.0e10"), floating point. Integers can be expressed in decimal, octal with a leading "0", or hexadecimal with a leading "0x".

In C99 and D [PDF], floating point numbers can also be hexadecimal. The exponent suffix must then have a "p" rather than "e" and gives a power of two expressed in decimal. This allows precise specification of floating point numbers without any risk of conversion errors. C11 and D also allow a "0b" prefix on integers to indicate a binary representation (e.g. "0b101010") and D allows underscores to be sprinkled though numbers to improve readability, so 1_000_000_000 is clearly the same value as 1e9.

Neither Rust nor Go have included hexadecimal floats. While Rust has included binary integers and the underscore spacing character, Go has left these out.

Another subtlety is that while C, D, Go, and many other languages allow a floating point number to start with a period (e.g. ".314159e1"), Rust does not. All numbers in Rust must start with a digit. There does not appear to be any syntactic ambiguity that would arise if a leading period were permitted, so this is presumably due to personal preference or accident.

In the language Virgil-III this choice is much clearer. Virgil has a fairly rich "tuple" concept [PDF] which provides a useful shorthand for a list of values. Members of a tuple can be accessed with a syntax similar to structure field references, only with a number rather than a name. So in:

var x:(int, int) = (3, 4); var w:int = x.1;

The variable "w" is assigned the value "4" as it is element one of the tuple "x". Supporting this syntax while also allowing ".1" to be a floating point number would require the tokenizer to know when to report two tokens ("dot" and "int") and when it is just one ("float"). While possible, this would be clumsy.

Many fractional numbers (e.g. 0.75) will start with a zero even in languages which allow a leading period (.75). Unlike the case with integers, the leading zero does not mean these number are interpreted in base eight. For 0.75 this is unlikely to cause confusion. For 0777.0 it might. Best practice for programmers would be to avoid the unnecessary digit in these cases and it would be nice if the language required that.

As well as prefixes, many languages allow suffixes on numbers with a couple of different meanings. Those few languages which have "complex" as a built-in type need a syntax for specifying "imaginary" constants. Go, like D, uses an "i" suffix. Python uses "j". Spreadsheets like LibreOffice localc or Microsoft Excel allow either "i" or "j". It is a pity more languages don't take that approach. Rust doesn't support native complex numbers, so it doesn't need to choose.

The other meaning of a suffix is to indicate the "size" of the value - how many bytes are expected to be used to store it. C and D allow u , l , ll , or f for unsigned, long, long long, and float, with a few combinations permitted. Rust allows u , u8 , u16 , u32 , u64 , i8 , i16 , i32 , i64 , f32 , and f64 which cover much the same set of sizes, but are more explicit. Perhaps fortunately, i is not a permitted suffix, so there is room to add imaginary numbers in the future if that turned out to be useful.

Go takes a completely different approach to the sizing of constants. The language specification talks about "untyped" constants though this seems to be some strange usage of the word "untyped" that I wasn't previously aware of. There are in fact "untyped integer" constants, "untyped floating point" constants, and even "untyped boolean" constants, which seem like they are untyped types. A more accurate term might be "unsized constants with unnamed types" though that is a little cumbersome.

These "untyped" constants have two particular properties. They are calculated using high precision with overflow forbidden, and they can be transparently converted to a different type provided that the exact value can be represented in the target type. So "1e15" is an untyped floating point constant which can be used where an int64 is expected, but not where an int32 is expected, as it requires 50 bits to store as an integer.

The specification states that "Constant expressions are always evaluated exactly" however some edge cases are to be expected:

print((1 + 1/1e130)-1, "

") print(1/1e130, "

")

results in:

+9.016581e-131 +1.000000e-130

so there does seem to be some limit to precision. Maintaining high precision and forbidding overflow means that there really is no need for size suffixes.

Strings

Everyone knows that strings are enclosed in single or double quotes. Or maybe backquotes ( ` ) or triple quotes ( ''' ). And that while they used to contain ASCII characters, UTF-8 is preferred these days. Except when it isn't, and UTF-16 or UTF-32 are needed.

Both Rust and Go, like C and others, use single quotes for characters and double quotes for strings, both with the standard set of escape sequences (though Rust inexplicably excludes \b , \v , \a , and \f ). This set includes \uXXXX and \UXXXXXXXX so that all Unicode code-points can be expressed using pure ASCII program text.

Go chooses to refer to character constants as "Runes" and provides the built in type " rune " to store them. In C and related languages " char " is used both for ASCII characters and 8-bit values. It appears that the Go developers wanted a clean break with that and do not provide a char type at all. rune (presumably more aesthetic than wchar ) stores (32-bit) Unicode characters while byte or uint8 store 8-bit values.

Rust keeps the name char for 32-bit Unicode characters and introduces u8 for 8-bit values.

The modern trend seems to be to disallow literal newlines inside quoted strings, so that missing quote characters can be quickly detected by the compiler or interpreter. Go follows this trend and, like D, uses the back quote (rather than the Python triple-quote) to surround "raw" strings in which escapes are not recognized and newlines are permitted. Rust bucks the trend by allowing literal newlines in strings and does not provide for uninterpreted strings at all.

Both Rust and Go assume UTF-8. They do not support the prefixes of C ( U"this is a string of 32bit characters" ) or the suffixes of D ( "another string of 32bit chars"d ), to declare a string to be a multibyte string.

Semicolons and expressions

The phrase "missing semicolon" still brings back memories from first-year computer science and learning Pascal. It was a running joke that whenever the lecturer asked "What does this code fragment do?" someone would call out "missing semicolon", and they were right more often than you would think.

In Pascal, a semicolon separates statements while in C it terminates some statements — if , for , while , switch and compound statements do not require a semicolon. Neither rule is particularly difficult to get used to, but both often require semicolons at the end of lines that can look unnecessary.

Go follows Pascal in that semicolons separate statements — every pair of statements must be separated. A semicolon is not needed before the " } " at the end of a block, though it is permitted there. Go also follows the pattern seen in Python and JavaScript where the semicolon is sometimes assumed at the end of a line (when a newline character is seen). The details of this "sometimes" is quite different between languages.

In Go, the insertion of semicolons happens during "lexical analysis", which is the step of language processing that breaks the stream of characters into a stream of tokens (i.e. a tokenizer). If a newline is detected on a non-empty line and the last token on the line was one of:

an identifier,

one of the keywords break , continue , fallthrough , or return

, , , or a numeric, rune, or string literal

one of ++ , -- , ) , ] , or }

then a semicolon is inserted at the location of the newline.

This imposes some style choices on the programmer such that:

if some_test { some_statement }

is not legal (the open brace must go on the same line as the condition), and:

a = c + d + e

is not legal — the operation ( + ) must go at the end of the first line, not the start of the second.

In contrast to this, JavaScript waits until the "parsing" step of language processing when the sequence of tokens is gathered into syntactic units (statements, expressions, etc.) following a context free grammar. JavaScript will insert a semicolon, provided that semicolon would serve to terminate a non-empty statement, if:

it finds a newline in a location that the grammar forbids a newline, such as after the word " break " or before the postfix operator " ++ ";

" or before the postfix operator " "; it finds a "}" or End-of-file that is not expected by the grammar

it finds any token that is not expected, which was separated from the previous token by at least one newline.

This often works well but brings its own share of style choices including the interesting suggestion to sometimes use a semicolon to start a statement.

While both of these approaches are workable, neither really seems ideal. They both force style choices which are rather arbitrary and seem designed to make life easy for the compiler rather than for the programmer.

Rust takes a very different approach to semicolons than Go or JavaScript or many other languages. Rather than making them less important and often unnecessary they are more important and have a significant semantic meaning.

One use involves the attributes mentioned earlier. When followed by a semicolon:

#[some_attribute];

the attribute applies to the entity (e.g. the function or module) that the attribute appears within. When not followed by a semicolon, the attribute applies to the entity that follows it. A missing semicolon could certainly make a big difference here.

The primary use of semicolons in Rust is much like that in C — they are used to terminate expressions by turning the expressions into statements, discarding any result. The effect is really quite different from C because of a related difference: many things that C considers to be statements, Rust considers to be expressions. A simple example is the if expression.

a = if b == c { 4 } else { 5 };

Here the if expression returns either "4" or "5", which is stored in "a".

A block, enclosed in braces ( { } ), typically includes a sequence of expressions with semicolons separating them. If the last expression is also followed by a semicolon, then the block-expression as a whole does not have a value — that last semicolon discards the final value. If the last expression is not followed by a semicolon, then the value of the block is the value of the last expression.

If this completely summed up the use of semicolons it would produce some undesirable requirements.

if condition { expression1; } else { expression2; } expression3;

This would not be permitted as there is no semicolon to discard the value of the if expression before expression3 . Having a semicolon after the last closing brace would be ugly, and that if expression doesn't actually return a value anyway (both internal expressions are terminated with a semicolon) so the language does not require the ugly semicolon and the above is valid Rust code. If the internal expression did return a value, for example if the internal semicolons were missing, then a semicolon would be required before expression3.

Following this line of reasoning leads to an interesting result.

if condition { function1() } else { function2() } expression3;

Is this code correct or is there a missing semicolon? To know the answer you need to know the types of the functions. If they do not return a value, then the code is correct. If they do, a semicolon is needed, either one at the end of the whole "if" expression, or one after each function call. So in Rust, we need to evaluate the types of expressions before we can be sure of correct semicolon usage in every case.

Now the above is probably just a silly example, and no one would ever write code like that, at least not deliberately. But the rules do seem to add an unnecessary complexity to the language, and the task of programming is complex enough as it is — adding more complexity through subtle language rules is not likely to help.

Possibly a bigger problem is that any tool that wishes to accurately analyze the syntax of a program needs to perform a complete type analysis. It is a known problem that the correct parsing of C code requires you to know which identifiers are typedefs and which are not. Rust isn't quite that bad as missing type information wouldn't lead to an incorrect parse, but at the very least it is a potential source of confusion.

Return

A final example of divergence on the little issues, though perhaps not quite so little as the others, can be found in returning values from functions using a return statement. Both Rust and Go support the traditional return and both allow multiple values to be returned: Go by simply allowing a list of return types, Rust through the "tuple" type which allows easy anonymous structures. Each language has its own variation on this theme.

If we look at the half million return statements in the Linux kernel, nearly 35,000 of them return a variable called "ret", "retval", "retn", or similar, and a further 20,000 return "err", "error", or similar. This totals more than 10% of total usage of return in the kernel. This suggests that there is often a need to declare a variable to hold the intended result of a function, rather than to just return a result as soon as it is known.

Go acknowledges this need by allowing the signature of a function to give names to the return values as well as the parameter values:

func open(filename string, flags int) (fd int, err int)

Here the (hypothetical) open() function returns two integers named fd (the file descriptor) and err . This provides useful documentation of the meaning of the return values (assuming programmers can be more creative than "retval") and also declares variables with the given names. These can be set whenever convenient in the code of the function and a simple:

return

with no expressions listed will use the values in those variables. Go requires that this return be present, even if it lists no values and is at the end of the function, which seems a little unnecessary, but isn't too burdensome.

There is evidence [YouTube] that some Go developers are not completely comfortable with this feature, though it isn't clear whether the feature itself is a problem, or rather the interplay with other features of Go.

Rust's variation on this theme we have already glimpsed with the observation that Rust has "expressions" in preference to "statements". The whole body of a function can be viewed as an expression and, provided it doesn't end with a semicolon, the value produced by that expression is the value returned from the function. The word return is not needed at all, though it is available and an explicit return expression within the function body will cause an early return with the given value.

Conclusion

There are many other little details, but this survey provides a good sampling of the many decisions that a language designer needs to make even after they have made the important decisions that shape the general utility of the language. There certainly are standards that are appearing and broadly being adhered to, such as for comments and identifiers, but it is a little disappointing that there is still such variability concerning the available representations of numbers and strings.

The story of semicolons and statement separation is clearly not a story we've heard the end of yet. While it is good to see language designers exploring the options, none of the approaches explored above seem entirely satisfactory. The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don't see the line indent playing a bigger role in interpretation of code. The particular rules used by Python may not be to everyone's liking, but the principle of making use of this very obvious aspect of a program seems sound.

We cannot expect ever to converge on a single language that suits every programmer and every task, but the more uniformity we can find on the little details, the easier it will be for programmers to move from language to language and maximize their productivity.