In my work, we often deal with codes: country codes, airport codes, airline codes, aircraft codes, and more. The thing they have in common is that they are really short. Sometimes 2-letter long, sometimes 3-letter long, sometimes a bit longer. Most of the time their value is used to compare against other codes, e.g., when searching through a collection. We hardly ever need to see their textual representation.

The initial choice was to represent them in programs by type std::string , a general-purpose type for storing sequences of characters. But, as with any general-purpose tools, they fail to take into account the particular usage cases which often offer room for improvements.

Many implementations of string offer a Short String Optimization (SSO), so that any memory allocation is avoided altogether, and the desired text is stored in the place where you would otherwise have pointers to the allocated memory. This is very useful and avoids expensive memory allocation, but it is still far from the optimum solution when we know we will always be storing sequences of length, say, 2.

Typically, string ’s size will comprise of three pointers: the beginning of the allocated array, the end of the allocated array, and the end of the currently stored sequence. On my platform sizeof(void*) is 8. This means that the size of the string will be no less than 24 ( char s). And we know we can store any 2-letter code in type char[2] , which is 12 times smaller. An extreme implementation of string (and inefficient one) could have the size of one pointer (which points to the structure of three pointers described above), but even then we would have the size of 8 whereas technically we only need 2.

The size of the data type affects run-time performance. Suppose you need to check if a given value exists in an array of values. The question is how many of the subsequent objects from the array you can fit in one CPU cache line. IOW, how many of the objects can you inspect without accessing the main memory. Suppose my CPU’s cache line is able to store 64 char s. This means I can fit 2 full strings with SSO. Compare this with 32 small codes that can be encoded in 2 characters.

Another cause of slow-down is the run-time check required by SSO. The decision whether to allocate memory or store characters locally is made dynamically, and cannot be deduced from the type alone. this decision is recorded as a flag inside the string implementation. Later, when we want to read the value, we first need to read the flag and then branch: from what address do we want to read the characters. This branch comes with a cost.

Designing class code

The following is an attempt at the design of a type that would take advantage of the information that we are only storing character strings of length N, where N is sufficiently small (say, no longer than 8).

Since N is known statically, at compile-time, we will make it part of the type. There is no need to check the length of the string at run-rime. For storage, we can use a raw array. So, the first approximation of the implementation could be the following:

template <int N> class code { char array_[N]; public: // the interface };

We leave no room for the trailing null character. In C-strings it is used to indicate the end of the sequence, but in our case we know exactly where the sequence ends: after the N-th character. There is no need to duplicate this information. Therefore we can save one character, which can prove quite a huge saving for a 2-letter code.

Relational ops

Most of the time, in our assumed usage scenario, the codes will be compared for order and equality, so let’s implement these first. The lexicographical ordering of strings can be implemented with memcmp :

template <int N> class code { // ... friend bool operator<(code l, code r) { return std::memcmp(l.array_, r.array_, N) < 0; } };

You might be uncomfortable about passing the arguments by value. But remember we are passing the trivially copyable object of the size of an int .

The equality comparison could be implemented in a similar manner, but there is a faster way. We can cast the two arrays to integer types and compare the integers: it should be one processor instruction.

Rather than using a reinterpret_cast , we will use a union. They are comparable and equally type-unsafe. The implementation of our storage will look like this:

template <int N> class code { typedef typename shortest_fitting_int<N>::type int_t; union { // anonymous union char array_[N]; int_t as_int_; }; public: // the interface };

We use an anonymous union. Names array_ and as_int_ can be thought of as two views through which we can observe the storage. Type int_t is the smallest built-in integer type capable of holding our array. It is defined in terms of a compile-time meta-function which selects the appropriate type given number N . You may find this syntax obscure. In C++ (with alias templates) this could be changed into:

template <int N> class code { typedef shortest_fitting_int_t<N> int_t; union { char array_[N]; int_t as_int_; }; public: // the interface };

Which is a bit shorter, but it now requires the reader to know the C++11 alias templates, and for some this may turn even more confusing Besides, I want to show you that this library for processing short codes is implementable in C++03. No new features are required.

Now, how do we define meta-function shortest_fitting_int ?

Conceptually, it would “return” the desired integral type as follows:

// pseudo-language; not C++ if (N <= sizeof(int8_t)) return int8_t; else if (N <= sizeof(int16_t)) return int16_t; else if (N <= sizeof(int32_t)) return int32_t; else return int64_t;

In C++, we have to write a big nested meta-function. If it scares you, I can only say, you are not alone. Consider it a price you have to pay for gaining additional performance.

Instead of ‘returning’ we will be defining a nested type type . Instead of an if -statement, we will use another meta-function boost::conditional .

template <int N> struct shortest_fitting_int { BOOST_STATIC_ASSERT(N > 0); BOOST_STATIC_ASSERT(N <= sizeof(int64_t)); typedef typename boost::conditional< (N <= sizeof(int8_t)), int8_t, typename boost::conditional< (N <= sizeof(int16_t)), int16_t, typename boost::conditional< (N <= sizeof(int32_t)), int32_t, int64_t >::type >::type >::type type; };

Again, with C++11, its static assertions, alias templates and built-in meta-functions, this could be simplified a bit:

template <int N> struct shortest_fitting_int { static_assert(N > 0, "negative N"); static_assert(N <= sizeof(int64_t), "N > 8"); typedef std::conditional_t< (N <= sizeof(int8_t)), int8_t, std::conditional_t< (N <= sizeof(int16_t)), int16_t, std::conditional_t< (N <= sizeof(int32_t)), int32_t, int64_t > > > type; };

That was the most difficult part of this post. Now, the implementation of operator== is trivial:

template <int N> class code { typedef typename shortest_fitting_int<N>::type int_t; union { char array_[N]; int_t as_int_; }; public: friend bool operator==(code l, code r) { return l.as_int_ == r.as_int_; } // other interface };

So, why not implement operator< similarly, with int comparison? Given the endian-ness issues, the resulting order might not sort the codes alphabetically. It might be acceptable for some applications, but I wanted the behavior to be more that of a string .

Warning: this technique of writing to one union member and reading from another may look suspicious. To the best of my knowledge, it is unclear from the Standard whether this is an undefined behavior or not. See this thread for the discussion. I used it in GCC, which documents it as supported behavior (see here).

Practical considerations

By now, we have covered the most interesting topics: compact storage and efficient relational ops. But a real-life type needs to address other, more mundane expectations. Construction, interoperability with other string-like interfaces (that expect C-arrays or std::string s), IO, and a couple of others.

How will code s obtain a value in a real application? Frameworks for parsing XML, GUI forms or communicating with DBs will most likely either use C-strings or std::string , so at the very minimum we must assure conversions to and from these types.

In order to convert to a std::string , we can provide a member function:

template <int N> class code { // ... std::string to_string() const { return std::string(array_, N); } };

We cannot provide any function like c_str() , because we do not (and do not want to) store the terminating null character. The users could use our interface like this:

c.to_string().c_str();

But this involves issues of the life-time of temporaries, and besides, we do not want to involve the creation of std::string s and risk the potential memory allocation, unless this is absolutely necessary.

Instead, we can offer a std::vector -like compatibility interface in form of member functions data() and size() :

template <int N> class code { // ... const char* data() const { return array_; } unsigned size() const { return static_cast<unsigned>(N); } };

Conversion to string is easy, because the set of valid code values is a subset of all string values. The conversion from string to code is more difficult, as it may fail. There are strings (of size different than N) that are not representable as code . We have to figure out a way of handling the conversion failure.

I have considered three options:

Simply require the correct string size as precondition, and move the responsibility for checking it to whoever requests the conversion. Make no precondition, check the size manually and throw an exception if the size is wrong. This way an attempt to pass the wrong string is treated as a condition that warrants (although not requires) the application shut-down. Have a conversion function that returns a Boolean result indicating success or failure. The caller may not know if the string fits or not, and may use our conversion function to check it.

Out of the three options, I consider (2) the worst one. I know that many people will strongly object. I do not intend to convince anyone. I can only try to explain my reasoning. I believe that passing an incompatible size is either a bug or a legitimate use. If it is a bug, bugs should be fixed, not reported at run-time. If it is a legitimate use, why trigger the stack unwinding and a potential application shut-down?

Option (2) tries to be somewhere in between: “you shouldn’t pass the incompatible string, but well, you can, and we will handle it, but don’t do it, but it is in fact OK if you do…” I disagree with such diffused stance. The incompatible string is either a legitimate input or not.

Options (1) and (3) give the opposite, but clear answers to the question what is a legitimate output. Either choice is good; one has to be made. I chose (3).

template <int N> class code { // ... bool from_string(const char* str, size_t len) { if (BOOST_LIKELY(len == N)) { as_int = int_t(); // zero out the buffer std::memcpy(array_, str, len); return true; } else { return false; } } friend bool code_from_string(const std::string& s, code& c) { return c.from_string(s.c_str(), s.length()); } friend bool code_from_string(const char* s, code& c) { return c.from_string(s, std::strlen(s)); } };

So, the contract is this: if the string’s size matches, we copy its value and return true ; otherwise we return false and leave the initial value of the code object unchanged. This is called the strong failure guarantee.

Macro BOOST_LIKELY is a fairly new addition to Boost.Config library. It gives a hint to the compiler which branch it should optimize for.

But that interface causes another issue. In order to use it, we already need to have an object of type code created, with some value: but what value? This is a problem similar to the common pattern of reading a variable from the stream:

int i; // what value? std::cin >> i;

Using a default construction like this, with indeterminate value is often considered a bad practice, and from time to time people have suggested that the language should offer a special syntax for expressing the intention here:

int i [[uninitialized]]; // can be written to but not read std::cin >> i;

While changing the language is the responsibility of the C++ Standards Committee, we can do something similar for our type code .

To implement this we will define a tag class:

class uninitialized_code_t{}; const uninitialized_code_t uninitialized_code;

The purpose of such tag class is to be a type distinct from any other type. We can now provide a constructor:

template <int N> class code { // ... code(uninitialized_code_t) : as_int_() {} };

We fill the array with zeros, which should be a distinct value than any other proper code, and will compare unequal to and less than any decent code.

We can now use the conversion like this:

code<3> c (uninitialized_code); // singular state if (code_from_string(s, ans)) { // ok, you can read the value of c } else { // you must not read from c }

You may find it inconvenient that this design forces you to write at least two statements in order to convert a std::string to code . But it has one benefit: it is as efficient as possible and expressive enough for you to use it to define any interface of your liking. You can use it to implement a conversion function that returns an boost::optional<code<N>> on failure, or throws, or whatever you see fit.

This poses a new question. Given that we have a way to construct a code with a singular value, should we not define a default constructor, and have it assign the singular value?

Personally, I am not at all in favor of creating by default (and silently) an object in a singular state. In the cases you need to use a singular state temporarily, having to explicitly type uninitialized_code somewhere is at least a clear indication that you are doing something risky locally. With a default constructor it is worse, because the singular state is seeded implicitly, and it is difficult to follow.

However, I must admit that I was beaten by real life here. This library was designed to replace the usage of std::string in the program I was maintaining. The reliance on the std::string ’s default constructor and in general on every type having a default constructor (putting an object in a half-initialized state) was so huge, that forcing the use of uninitialized_code everywhere, only to half-initialize codes inside other default constructors was pointless. I gave up and added the default constructor.

But that just triggers another set of issues. Because it is easy to inadvertently use the default construction, we are going to have a lot of default-constructed objects floating around, and we need to make sure that all the operations from the code ’s interface are well defined for a default-constructed and in fact a meaningless code . We also need to provide a way of checking if a given object is in a default-constructed state. Luckily, the half-initialized state can be represented by a regular combination of characters in the same array; so, to a great extent the default-created state is just a regular state. The only tricky part is how the null chars are treated when dealing with string s and IO functions.

With the default-construction-related functions:

template <int N> class code { // ... code() : as_int_() {} bool is_initialized() const { return as_int_ != int_t(); } };

We need to change at least function to_string :

template <int N> class code { // ... std::string to_string() const { if (BOOST_LIKELY(is_initialized())) return std::string(data(), size()); else return std::string(); } };

This change is not strictly necessary, but I want to make an additional guarantee: that an uninitialized code converted to string returns an empty string:

assert (code<N>().to_string().empty());

The special effort is required because of how std::string handles null characters. For details see this post.

To be continued…

I have to stop for today. There is more to be said about the interface and the implementation of class template code and, to be honest, I haven’t yet even mentioned the most interesting feature of the library: type safety. But these have to wait for another post.