The first algorithm I ever worked on in the C# compiler was the optimizer that handles string concatenations.(Unfortunately I did not manage to port these optimizations to the Roslyn codebase before I left; hopefully someone will get to that!) It’s all pretty straightforward, but there are a few clever bits that I thought I might discuss today.

First off, let’s consider just straight up concatenation of string variables; in fact, let’s assume that all variables are strings for the purposes of this article. If you have:

r = a + b;

then clearly that is going to be generated as

r = String.Concat(a, b);

But what about

r = a + b + c;

? String addition, like all addition in C#, is left associative, so this is the same as

r = (a + b) + c;

So you might think that would be generated as

r = String.Concat(String.Concat(a, b), c);

but the compiler can be more efficient than that because there is an overload of String.Concat which takes three operands and produces the same result:

r = String.Concat(a, b, c);

Ditto for four operands. However when we get to five or more operands the compiler switches to a different overload:

r = a + b + c + d + e;

is generated as

r = String.Concat(new String[5] { a, b, c, d, e });

It’s a bit unfortunate that you have to take on the cost of allocating an array, but there it is.

That’s all very straightforward, but there are more optimizations that the C# compiler performs. The main optimization depends upon the fact that string concatenation is fully associative. (And that String.Concat has no observable side effects.) That is, the parentheses don’t matter. You might think that an oddity like

r = a + (b + c);

would have to be generated as

r = String.Concat(a, String.Concat(b, c));

but of course we can observe that since string concatenation is associative, this is no different than

r = (a + b) + c;

which we already know how to generate efficiently. The C# compiler’s string optimizer automatically rewrites nested string concatenations into the left-parenthesized form no matter how you parenthesize them. Except when it doesn’t, which is our next optimization. Let’s now consider the impact of constants:

r = "A" + "B" + c;

The compiler will parse this as:

r = ("A" + "B") + c;

and of course that first term is a compile-time constant according to the spec. The compiler automatically computes values of compile-time constants, so this is the same as:

r = String.Concat("AB", c);

But what about this?

r = a + "B" + "C";

There are two compile-time constants, but this is parsed as

r = (a + "B") + "C";

Since both addition operators have a non-constant element on their left side, you might think this would have to be generated as

r = String.Concat(a, "B", "C");

But again associativity comes to the rescue. The optimizer notices that the original expression is the same as

r = a + ("B" + "C");

and generates

r = String.Concat(a, "BC");

And it will even do the right consolidation of constants “in the middle”:

r = a + "B" + "C" + d;

is generated as though you’d said:

r = (a + ("B" + "C")) + d;

to produce

r = String.Concat(a, "BC", d);

The C# compiler also knows that empty string constants are the identity elements of concatenation; they are eliminated. If you say

r = a + "" + b;

Then the code generated is not

r = String.Concat(a, "", b);

but rather simply

r = String.Concat(a, b);

Null string constants are likewise identity elements:

const string NullString = null; r = a + NullString + b;

is similarly generated as

r = String.Concat(a, b);

OK I must admit I just made a bald-faced lie there; did you catch it?

The null and empty strings are not identities of concatenation. An identity of an operator has the property that it leaves the other operand unchanged; zero is the identity of addition of numbers because any number plus zero gives you the original number. But there is one possible operand that is not preserved by addition with an empty or null string: the null string! NullString + NullString and NullString + "" both result in the empty string, not the null string, so neither is an identity.

You might think then that

r = a + NullString;

must be generated as

r = String.Concat(a, NullString);

because the compiler needs to ensure that if a is null then the result is the empty string. In fact the C# compiler anticipates what String.Concat will do and simply inlines it. This is generated the same as:

r = a ?? "";

which I think is a neat trick, to elide the concatenation entirely.

Next time on FAIC: An optimization the C# compiler does not perform.

