Raw string literals -- where we are, how we got here

Now that things have largely stabilized with raw string literals, let me summarize where we are, and how we got here. ## The proposal Where we are now is that a raw string literal consists of an opening delimiter which is a sequence of N consecutive backticks, for some N > 0, a body which may contain any characters (including newlines) except for a sequence of N consecutive backticks, and a closing delimiter of N consecutive backticks. Any line-end sequences (CR, LF, CRLF) are normalized to a single newline (LF), and the remainder of the body is treated without any further transformation (including without unicode escape processing), and placed in a String. No other processing is done on the contents. A raw string literal has type String, just like a traditional string literal, and can be used anywhere an expression of type String can be used (assignment, concatenation, etc.) Examples: String s = `Doesn't have a

newline character in it`; String ss = `a multi- line-string`; String sss = ``a string with a single tick (`) character in it``; String ssss = `a string with two ticks (``) in it`; String sssss = `````a string literal with gratuitously many ticks in its delimiter`````; Note that the delimiter need not be _more_ ticks than the longest tick sequence in the body; if the body contains sequences of two ticks and three ticks, it can be delimited by one tick, four ticks, five ticks, etc. This makes it possible to choose a minimal delimiter that doesn't interfere with the body. ## Design Center The design center for this feature is _raw string literals_. Not multi-line strings (though this is well handled), not interpolated strings (though this can be considered in the future.) It turns off all inline escaping, even unicode escaping (which is usually handled by the lexer before the production even sees the characters.) We stay as true as we can to this principle: raw means raw, not 99% raw with a little bit of escaping. (The single exception is normalizing of carriage control, the absence of which would just be too surprising.) The primary use case addressed by raw string literals are snippets of code from other languages embedded in Java source files. Here we interpret "languages" broadly; they could be traditional programming languages, specialized languages like regular expressions or SQL, or human languages. We want that the Java lexing not interfere at all; given a suitable O(1) incantation (picking a non-conflicting delimiter), you can freely cut and paste the foreign string to and from Java. Being able to do this is not only convenient, but it reduces errors due to hand-mangling the string, and enhances readability because the embedded snippet is free of interference from Java. Choosing raw-ness as a design center leads to a simpler design, which is good, but it also is _more stable_, because it leads us away from the temptation to tweak the rules here and there in ways that might be subjectively attractive, but that further increase the complexity of the feature. This design choice belies a priority choice: the high-order bit is _no embedding anomalies_. Users don't have to reason about whether they need to hand-mangle a snippet to avoid it being mangled by the compiler or runtime; given a suitable choice of delimiter, there's nothing else to think about. (IDEs can help with the "writing code" part of this.) The various additional features we might be tempted to put in (special processing for leading or trailing blank lines, leading white space, trimming to markers, etc) can instead be handled via library functionality. Since raw string literals are Strings, we can further process them with library code -- both JDK code and user code (though methods on String have the advantage that they can be chained, rather than wrapped, which most users will prefer). Adding new string manipulation features via libraries rather than through the language is easier, can be done by users, and is not constrained by the demands of consistency (you can have seven different trimming methods, each with their own definition of whitespace, if you like), whereas a language feature has to be one-size-fits-all. Moving this complexity to the library where possible leads to a simpler feature and more choices for users. #### A road not taken We choose to divide the world of string literals first into raw and non-raw literals; from this, multi-line strings falls out for free as we can treat line breaks in the source file as just more raw characters. We could have chosen, instead, to first divide the world into single and multi-line strings, and then into raw and non-raw; this would have left us with four choices (raw single line, raw multi-line, cooked single-line, cooked multi-line.) This also would have been a defensible position, but seemed to add lexical complexity for little gain. #### The exception that proves the rule The one exception to raw-ness is that we normalize the line terminators to the most common (*nix) choice of a single newline, rather than using the platform-specific line terminator on the system that happens to have compiled the classfile. The alternative would have just been too surprising. ## Syntax Given that this feature has such a high syntax-to-substance ratio, we should expect more than the usual number of syntax opinions. Let's start with some consequences of our chosen design center. #### No fixed delimiter From the design choice above, it is a forced move to accept variable delimiters. Otherwise, one cannot represent a string with the delimiter in a raw string, without inventing an escaping mechanism, and subverting our "raw means raw" goal. The "self-embedding test" is not a mere theoretical goal. Since the snippets we expect to paste into Java source are not randomly chosen strings of characters, but meaningful snippets of some language, the likelihood of wanting to represent a string that contains the chosen delimiter goes up. Even if you are willing to dismiss "embed Java in Java" as a serious use case (we're not), people also want a familiar delimiter, which means something that looks like the delimiter in other languages, further increasing the chance of collision. (For example, if we'd picked a fixed triple quote delimiter, then you couldn't embed Groovy or Python code, among others -- surely a real use case). Fixed delimiters (of any length) and "raw means raw" are not compatible goals, and we choose "raw means raw". The credible options for variable delimiters are using a repeating delimiter sequence (say, any number of ticks), or some sort of user-provided nonce ("here" docs), or both. Nonces impose a higher congnitive load on readers, and their benefit accrues mostly to corner cases, so the more constrained option of repeating delimiters seems preferable. #### Why not 'just' use triple quotes People's syntax preferences are guided by familiarity, so we should expect suggestions to be biased towards what "similar" languages already do. So the suggestion of using """triple quotes""" should be expected. We've already discussed how a fixed delimiter is not acceptable. So at a minimum, this would have to be adjusted to "three or more." While some people find triple quotes natural (or at least familiar), others find it offensively heavyweight. Neither crowd is going to convince the other. #### But ticks are too light The opposite of the "triple quotes are too heavy" argument is "ticks are too light"; that a single tick is a lightweight character, and could go unnoticed, especially if your monitor hasn't been cleaned for a while. Unfortunately the quote-like delimiters in the middle of the weight range are taken by other activities. Again, we can't satisfy the "too light" and "too heavy" crowd at the same time; whichever we do will make some people unhappy. #### Why do you have to always do something new? The quoting scheme chosen -- any number of ticks -- is actually taken from something we all use: Markdown (https://daringfireball.net/projects/markdown/syntax), which permits any number of ticks to be used for infix sequences, and any different number of ticks to be embedded. (Where we depart from Markdown is that Markdown strips any leading and trailing newlines from multi-line tick blocks, an appropriate trick for a page presentation language, but not consistent with the design goal of "raw".) #### But I want indentation stripping When embedding a snippet of one language in another, both of which support indentation, we are left with two choices: indent the enclosed block exactly, which has the effect of the code "jutting out to the left", or indent the enclosed block relative to the enclosing block, which has the effect of having more indentation than you might want for the enclosed block. Sometimes this doesn't matter, but sometimes it does. Whatever we do, one of these crowds will be unhappy. When in doubt, we stick to the principle of "raw means raw", and provide indentation stripping via new instance methods on `String` to allow a range of trimming options, such as `trimIndent()`. #### But I want leading / trailing empty lines Some people would like for the language to strip off leading and trailing blank lines. Like indentation stripping, this is going to be what people want sometimes, and sometimes not. And given that again, we can't do both, we again, are guided by "raw means raw", and provide library means to strip the extraneous newlines. #### But I want a marker character to make it obvious Some people would like a margin marker character, so they can manage margins like this: foo(`This is a long string >the characters up to, and >including, the bracket are stripped >by the compiler > and this line is indented`) (Others would argue the marker character should be "|".) Again, we believe these sorts of transforms are the purview of libraries, not language, and will be provided. #### But people will make ASCII art `````````````````` `Yes, they might.` `````````````````` #### But I want to use unicode escaping There will be library support for explicitly processing Unicode escape sequences, or backslash escape sequences, or both. #### But calling library methods like `longString`.trim() is ugly You say ugly; I say simple and transparent. #### But doing these things in libraries has to be slower and yield more bloated bytecode No, it doesn't. ## Anomalies and puzzlers While the proposed scheme is lexically very simple, it does have some at least one surprising consequence, as well as at least one restriction: - The empty string cannot be represented by a raw string literal (because two consecutive ticks will be interpreted as a double-tick delimiter, not a starting and ending delimiter); - String containing line delimiters other than

cannot be represented directly by a raw string literal. The latter anomaly is true for any scheme that is free of embedding anomalies (escaping) and that normalizes newlines. If we chose to not normalize newlines, we'd arguably have a worse anomaly, which is that the carriage control of a raw string depends on the platform you compiled it on. The empty-string anomaly is scary at first, but, in my opinion, is much less of a concern than the initial surprise makes it appear. Once you learn it, you won't forget it -- and IDEs and compilers will provide feedback that help you learn it. It is also easily avoided: use traditional string literals unless you have a specific need for raw-ness. There already is a perfectly valid way to denote the empty string. #### Can't these be fixed? These anomalies can be moved around by tweaking the rules, but the result is going to be more complicated rules and the same number (or more) of anomalies, just in different places -- and sometimes in worse places. While there is room to subjectively differ on which anomalies are worse than others, we believe that the simplicity of this scheme, and its freedom from embedding anomalies, makes it the winner. Because we start with such a simple rule (any number of consecutive ticks), pretty much any tweak is going to be complexity-increasing. It seems a poor tradeoff to make the feature more complex and less convenient for everyone, just to cater to empty strings. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.openjdk.java.net/pipermail/amber-spec-experts/attachments/20180327/4f60666f/attachment-0001.html>