CharacterView inherits the default IndexingIterator from Collection , so each call to next does the following:

Uses subscript to get the current character, which constructs a new String from a slice of the buffer and then a Character from that (source). Advances its index to the next character, which is a relatively complex algorithm in order to achieve Unicode correctness (source).

This is still a lot of overhead—and pulling the characters being matched into separate variables is a pretty big sacrifice to code readability that we shouldn’t have to make. The set of characters we need to recognize in this example is fairly small, but a tokenizer for a more complex language could have many more. Can we do better?

Do you need characters at all? If not, use something else.

The power of Character is that it treats code point sequences as a single “human-perceived character” and that it cleanly handles canonical equivalence, but it is not the natural representation of string elements.

Swift String values contain an instance of the _StringCore type, which is optimized to store either ASCII or UTF-16 encoded text. When Swift gives you a CharacterView , it needs to transform that underlying representation into Character values. Even though it does this lazily as you traverse the collection, if you iterate over the entire string, CharacterView is doing a lot of memory allocations and complex calculations to give you those values.

So, if your use case doesn’t need the advanced capabilities that Character provides—our scanner doesn’t—then you’ll get much better performance by using something closer to the string’s internal representation. We still have three string views to consider: UTF8View , UTF16View , and UnicodeScalarView .

Digging into string internals

Let’s start by taking a closer look at _StringCore . First, notice the leading underscore: this is one of those types that was made public as an implementation detail, but which we shouldn’t touch in our code. The Swift team can—and likely will—change it in ways that would break us if we accessed it directly. That said, it’s beneficial to understand how it works and affects performance. We won’t use _StringCore directly in any code that we write, but we’ll explore its implementation so that we can make informed decisions about how to make our scanner more efficient.

As I mentioned above, _StringCore is optimized to handle both ASCII and UTF-16 encoded text—it does so by using some clever bit-twiddling and arithmetic tricks. An instance variable stores a pointer to the underlying bytes that make up the string content. Another instance variable keeps track of the count of ASCII or UTF-16 code units in the string, but the most significant bit of this count is special. _StringCore calls it elementShift : if it equals 0, the buffer contains ASCII data; if it equals 1, the buffer contains UTF-16 data. In other words, adding 1 to the value of this bit gives us the number of bytes that we need to advance to get from one code unit to the next: 1 for ASCII and 2 for UTF-16. _StringCore also ensures that internal consistency is maintained during other string operations; for example, if you append a string with UTF-16 data to one with ASCII data, the ASCII data will be widened to UTF-16.

Since _StringCore can store both kinds of text, how does it know which one to use for a given string? The compiler examines string literals and detects whether it contains only 7-bit ASCII (code units in the range 0 to 127) or if it contains Unicode, and it uses different initializers to create the String values. Let’s consider these two strings:

let s1 = "foo"

let s2 = "föö"

The first can be represented in 7-bit ASCII, while the second cannot. If we compile this and examine the SIL, we see that slightly different code is generated:

%13 = string_literal utf8 "foo"

%14 = integer_literal $Builtin.Word, 3

%15 = integer_literal $Builtin.Int1, -1

%16 = metatype $@thin String.Type

%17 = function_ref @Swift.String.init (

_builtinStringLiteral : Builtin.RawPointer,

utf8CodeUnitCount : Builtin.Word,

isASCII : Builtin.Int1) -> Swift.String :

$@convention(method) (Builtin.RawPointer, Builtin.Word,

Builtin.Int1, @thin String.Type) -> @owned String

%18 = apply %17(%13, %14, %15, %16) :

$@convention(method) (Builtin.RawPointer, Builtin.Word,

Builtin.Int1, @thin String.Type) -> @owned String %22 = string_literal utf16 "föö"

%23 = integer_literal $Builtin.Word, 3

%24 = integer_literal $Builtin.Int1, 0

%25 = metatype $@thin String.Type

%26 = function_ref @Swift.String.init (

_builtinUTF16StringLiteral : Builtin.RawPointer,

utf16CodeUnitCount : Builtin.Word) -> Swift.String :

$@convention(method) (Builtin.RawPointer, Builtin.Word,

@thin String.Type) -> @owned String

%27 = apply %26(%22, %23, %25) :

$@convention(method) (Builtin.RawPointer, Builtin.Word,

@thin String.Type) -> @owned String

The String initializers in %17 and %26 call _StringCore.init with the appropriate value for elementShift to indicate whether the string is ASCII or UTF-16 (source).

Armed with this knowledge, we might guess that either the UTF8View or the UTF16View would be quite fast, depending on what kind of string we have. That last part is tricky, though: there’s no fast way to determine whether we have an ASCII string or a UTF-16 one. It’s an implementation detail of _StringCore and not exposed by the public API (nor should it be). Even if we did know, how would we use that information? Would we write two separate scanners—one for ASCII text and one for UTF-16? That’s not a realistic approach, either.

Instead of jumping right into the details of how UTF8View and UTF16View are implemented, first let’s consider the four possible cases and use our intuition to hypothesize about how efficient they might be:

A String with ASCII data accessed through UTF8View : Since ASCII is a subset of UTF-8, the data matches the way we’re viewing it, so this should be fast. A String with UTF-16 data accessed through UTF8View : The UTF-16 data has to be transcoded to UTF-8 and the iterator must maintain state about where it currently sits within a character’s UTF-8 code units, so the computational overhead would make this somewhat slower. A String with ASCII data accessed through UTF16View : ASCII characters can be cheaply widened to 16 bit integers that are valid UTF-16 code units, so this should be fairly fast. A String with UTF-16 data accessed through UTF16View : The data matches the way we’re viewing it, so this should be fast.

Since there’s a case where we believe UTF8View will perform poorly, we’ll set it aside. On the other hand, UTF16View looks promising, so let’s try converting our tokenizer to one that uses it. We can start by switching from String.characters to String.utf16 and updating uses of Character to UTF16.CodeUnit . A snippet is shown below:

There’s just one problem—this won’t compile yet. There are two big issues:

Unlike with Character , Swift does not let single-character string literals be used as UTF-16 code units. We would have to replace the literals in our case patterns with their raw numeric UTF-16 values, which isn’t readable or easily maintainable. In the integerToken and stringToken methods, we collect the token text by appending each character to a new string. There is no overload of String.append that can take a UTF-16 code unit, UTF16View is not mutable, and converting a single UTF-16 code unit to a String which could be appended is not always feasible (for example, if it is part of a surrogate pair). We would have to use an alternate approach, like collecting the UTF-16 code units in an array and then decoding that into a String after the fact.

In other words, to use UTF16View , the code we would have to write would be fairly low-level compared to what we originally had. Swift is not a low-level language and we shouldn’t have to make that sacrifice. Let’s reject this approach.

Help me, UnicodeScalarView—you’re my only hope.

With all this in mind, the only option left is UnicodeScalarView . Either I’ve saved the best for last, or this article is going to have a very disappointing ending. Let’s look at what UnicodeScalarView and its element type UnicodeScalar are capable of.

First, UnicodeScalar conforms to ExpressibleByUnicodeScalarLiteral , so the compiler will let us create a UnicodeScalar using a string literal that contains a single scalar. In other words, we can still write case "," and have our code be readable.

In addition to the readability benefit, initializing a UnicodeScalar from a string literal is fast. UnicodeScalar is a wrapper around a 32-bit unsigned integer, and there are no string buffer allocations involved like we saw with Character . Let’s look at the generated SIL for a simple assignment:

// let scalar: UnicodeScalar = "ö" %12 = global_addr @StringTest.scalar : Swift.UnicodeScalar :

$*UnicodeScalar

%13 = integer_literal $Builtin.Int32, 246

%14 = struct $UInt32 (%13 : $Builtin.Int32)

%15 = struct $UnicodeScalar (%14 : $UInt32)

store %15 to %12 : $*UnicodeScalar

The compiler has already extracted the scalar from the string (246 is the decimal Unicode code point for “Latin Small Letter O with Diaeresis”), and the remaining instructions are simple value type conversions. The resulting machine code will be nothing more than an instruction that loads the number 246 into a register or memory address.

Next, we can’t append UnicodeScalar directly to a String , but unlike UTF8View and UTF16View , UnicodeScalarView is mutable and the append(UnicodeScalar) operation can be found there. This makes the implementation of the integerToken and stringToken methods work as they did before with only minor modifications.

Finally, iterating over the UnicodeScalarView is very fast as well. The iterator delegates most of its behavior to the UTF16 codec type, which decodes the underlying ASCII or UTF-16 code units in the string buffer and produces UnicodeScalar values. The decode method can be seen here and is fairly straightforward. The vast majority (96.875%) of UTF-16 code units are numerically equivalent to their Unicode scalar (those in the range 0x0000–0xD7FF or 0xE000–0xFFFF). The remaining ones are the surrogate code units, and well-formed UTF-16 surrogate pairs consist of one high surrogate followed immediately by one low surrogate. The decoder never has to look more than one position ahead to convert any UTF-16 code unit or surrogate pair into a UnicodeScalar , and the conversion is simple bitwise arithmetic that does not require any additional memory to be allocated.

A downside is that if we switch to UnicodeScalar , each element we process no longer necessarily corresponds to a single human-recognizable character and we lose the ability to easily check for canonical equivalence. Fortunately, this does not affect our tokenizer. The basic tokens we recognize, like comma and semicolon, are simple ASCII characters. The only location where Unicode characters might appear is inside double quoted strings, and those scalars will simply be appended to another string that we get back in the returned token.

All of this seems promising, so let’s test it out by creating a UnicodeScalarBasedTokenizer . The conversion from CharacterBasedTokenizer is very simple (view source). Running the benchmark:

UnicodeScalarBasedTokenizer:

..... 108.4403786 ms ± 2.50400540742841 ms (mean ± SD)

That’s a significant improvement—50 times faster than the version that used CharacterView , with only minor changes.

Summary

Swift’s String type is a powerful abstraction that provides access to different encodings through a set of “views.” In particular, CharacterView and the Character type provide a clean and convenient API that solves a long-standing problem with text processing—identifying clusters of code points that form a single human-recognizable character and determining canonical equality without having to roll your own algorithms to handle low-level Unicode details.

This power comes with a computational cost, however. As they are currently implemented, Character values involve time-consuming heap allocations that aren’t immediately obvious to callers—especially for Character literals. Although these allocations are often short-lived, the recurring allocate-and-free churn can significantly drag down the performance of string processing in tight loops.

Since String.characters is the “obvious” way to access a string’s elements, it’s easy to write code that performs sub-optimally when you don’t need to take advantage of those features.