Ben's Blog

Storing C++ Objects in Distributed Memory

This is part of a series of posts on BCL, the distributed data structure library I work on. I assume here that the reader is familiar with remote pointers, which are C++ objects that allow the user to read and write to memory that lives in processes on other nodes in a cluster.

This post focuses on the problem of storing objects in remote memory. Specifically, how can we create a container abstraction that can store complex types like std::string or std::vector<T> in distributed memory while also preserving efficiency for simple types like int . I start by discussing why we can't simply byte-copy all objects into distributed memory, then go on to create a container class that can wrap complicated objects so we can store them in distributed memory. I then explore why this container class isn't the best option for simple types and create a partial specialization that will allow us to transparently use the container interface for simple types while achieving good performance.

Introduction

One of the challenges of building distributed data structures is that they need to be able to store things of various types, and not just primitive types like int and float , but complex types like those in the C++ STL or that users might define for themselves.

For some types, we can store objects in distributed memory simply by byte-copying them there. Examples of this kind of type include int , other primitive types, and structs containing only primitive types.

// This struct is byte-copyable, or // `trivially copyable` in STL parlance. struct foo { int foo; float bar; char buf[23]; };

Let's look at an example where this is not the case. An essential example program for hash tables, which are perhaps one of the most used distributed data structures, involves storing strings as keys to integer values.

HashMap<std::string, int> map; std::string my_string = std::to_string(my_rank()); map.insert({my_string, my_rank()});

What happens when we try to store an STL string in distributed memory? Well, let's look at a representative std::string implementation. Since a string can have variable size, it must carry a pointer, and possibly a length inside it.

struct string { size_t len_; char* buf_; };

We can't byte-copy this string into remote memory because it contains a regular pointer to a process' virtual address space. If another process retrieves this string object and tries to use it, dereferencing the pointer buf_ will lead to memory errors.

Introducing Containers

We can deal with this problem by introducing a container class. Containers will store values in a serialized form so that the container itself can be byte-copied to another process, then unpacked and read.

To store anything of variable size in distributed memory, we'll need to use a remote pointer, which we can allocate (and reallocate) to fit whatever data size we encounter at runtime [1]. We can take a first stab at sketching out a container for storing variable size objects by separating our abstraction into two parts: (1) a serialization template struct, and (2) a container class. While the serialization struct will provide functions for serializing and deserializing the data to and from a byte-copyable form, the container class will actually store the data in remote memory.

Let's take a first pass at creating a serialization struct for strings. Since strings are variable size, our serialization method will need to return something variable size. The return value will also need to be something that the container object can understand and store in remote memory. Let's assume we have a standard serial_ptr<U> class that can store a string of values of type U . We can then create a template struct serialize<T> for serializing a value of type T and a specialization for std::string that provides a way to serialize and deserialize strings.

template <typename T> struct serialize; template <> struct serialize<std::string> { serial_ptr<char> serialize(const std::string& str) { serial_ptr<char> ptr(str.length()); for (size_t i = 0; i < str.length(); i++) { ptr[i] = str[i]; } return ptr; } std::string deserialize(const serial_ptr<char>& ptr) { return std::string(ptr.data(), ptr.size()); } };

The implementation details are not particularly important here (we're just reading from and writing to a string), but the import thing is that this provides an interface that we can then use to create a container class for storing strings in distributed memory. That is, the methods serialize<std::string>::serialize(const& std::string) and serialize<std::string>::deserialize(const& serial_ptr<char>) .

With this serialization interface in hand, we can proceed to write a container class that will store our variable-size object in distributed memory.

template <typename T, typename Serialize = serialize<T>> class container { public: // When we call Serialize::serialize(T), // the type we get back // (of the form serial_ptr<U>) using serial_ptr_type = decltype( std::declval<Serialize>().serialize( std::declval<T>() ) ); // From serial_ptr<U>, the type U. using serialized_type = typename serial_ptr_type::type; T get() const; void set(const T& value); private: remote_ptr<serialized_type> ptr_ = nullptr; size_t len_; };

So, for a serialization method that serializes something of type T to type serial_ptr<U> ( U is referred to by the type alias serialized_type above), we'll store in remote memory an array of U s. We can now define methods to read and write to the container. The important detail here is that in order to get a value from a container, we have to perform an rget operation to read a variable-size object from a remote pointer, and in order to set a new value in a container, we have to perform an rput to write a variable-size object.

template <typename T, typename Serialize> T container<T, Serialize>::get() { if (ptr_ != nullptr) { serial_ptr<serialized_type> serialized(len_); rget(ptr_, serialized.data(), len_); return Serialize{}.deserialize(serialized); } else { return T(); } } template <typename T, typename Serialize> void container<T, Serialize>::set(const T& value) { if (ptr_ != nullptr) { delete_(ptr_); } auto serialized = Serialize{}.serialize(value); ptr_ = alloc<serialized_type>(serialized.size()); rput(ptr_, serialized, serialized.size()); len_ = serialized.size(); }

Now that we have a container class, we can easily create remote arrays that contain complex objects and write programs that manipulate them.

// Create a remotely accessible array on rank 0 remote_ptr<container<std::string>> ptr = nullptr; if (rank() == 0) { ptr = alloc<container<std::string>>(nprocs()); } ptr = broadcast(ptr, 0); // Create a container, store a string inside it. // This involves writing the variable-sized string // to remote memory. container<std::string> val; val.set(std::to_string(my_rank())); // Store the container inside the array. ptr[my_rank()] = val; // Print out all the strings. if (rank() == 0) { for (size_t i = 0; i < nprocs(); i++) { container<std::string> val; val = ptr[i]; std::cout << val.get() << std::endl; } }

Cost

So now that we have a container class that can store any C++ object that we can write a serialization struct for, what is the cost of the abstraction that we've created? Intuitively, we know that using this container class to store or retrieve an object involves two remote communication operations, one to fetch the container object, and a second to fetch the serialized object itself from the remote pointer inside the container class. So, if we're storing an int , that's two rget s, one to fetch a pointer, and one to fetch the actual value. Seems inefficent.

More formally, let's assume the common alpha-beta communication cost model, where the cost of reading or writing is \(\alpha + \beta n\). \(\alpha\) is the latency cost of performing a read or write, \(\beta\) is the bandwidth cost of sending a byte \((1 / \text{bandwidth})\), and \(n\) is the number of bytes to send.

With the container, we first have to retrieve the container object itself, which has a bandwidth cost of $$ \alpha + \beta \epsilon, $$ where \(\epsilon\) is the size of a container, which for most architectures is 32 bytes (due to alignment). Then, we have the cost in the get or set operation of actually pulling the data from the pointer inside the container object. We can call this $$ \alpha + \beta n, $$ where \(n\) is the number of bytes in the serialized object. This makes the total cost here $$ 2\alpha + \beta (n + \epsilon) $$

For large objects, this is likely to be efficient, since the value of \(n\) will be quite large, meaning that the cost of accessing the object is fundamentally bandwidth bound. However, for small objects, the value of \(n\) will be small, and the cost of the roundtrip latency necessary to access the element, \(2\alpha\), will dominate. This means that in the worst case, we've introduced a 2x overhead with our container abstraction.

Optimizing for Small, Fixed-Size Objects

In the case of small, fixed-size objects, adding twice the latency is a big cost, so we'd like to write an improved container implementation for these types. Thankfully, C++ allows us to transparently pick a different container implementation for these types by writing a partial specialization for them.

To start with, how would we like to write a serializer for a simple type, like, say int ? The serializer is still important here because it needs to signify to the container that we can store this type in distributed memory. Since no serialization is actually necessary for int , the most straightforward serializer would just return the original value to serialize it and return the serialized value to deserialize it.

template <> struct serialize<int> { int serialize(const int& value) { return value; } int deserialize(const int& value) { return value; } };

Note that this serializer doesn't return a serial_ptr<U> object, which was the special class we serialized variable-size objects to. Since we're now just storing a regular object of type T in distributed memory, which the user is guaranteeing to us by contract is byte-copyable, we can store the value itself inside the container instead of holding a pointer to the serialized value.

// Enable only if the return type of serialization is // *not* a serial_ptr template <typename T, typename Serialize = serialize<T>, typename = std::enable_if_t< !is_serial_ptr< decltype(Serialize{}.serialize(T()))>::value > > class container { public: // Return type of serialization using serialized_type = decltype( std::declval<Serialize>().serialize( std::declval<T>() ) ); T get() const; void set(const T& value); private: serialized_type value_; };

Now we're just storing a serialized_type value directly inside the container, instead of using a remote pointer. Our get and set methods can directly use the field inside our container now.

template <typename T, typename Serialize> T container<T, Serialize>::get() { return Serialize{}.deserialize(value_); } template <typename T, typename Serialize> void container<T, Serialize>::set(const T& value) { value_ = Serialize{}.serialize(value); }

Performance

Complex Types

To get an idea of how this code actually performs in the wild, let's look at the assembly we get when we compile an example. Let's start with the simple string example from above. Looking at the generated assembly, we can see that reading an std::string object from remote memory is a two-stage process: first, we must read the container, then, we must read in the data pointed to by the container. So there are two calls to remote_read (lines 319 and 333 of the assembly), which is expensive if the objects are small and bound by latency. This is, however, unavoidable for variable-size objects.

Also notable is the fact that we're making an extra copy of the local string object: first copying it into a serial_ptr object, then copying from that object to an std::string . If std::string provided a special constructor that allowed us to move a char pointer to the string object instead of copying the data, we might be able to avoid this with a carefully crafted serialize struct, but without cooperation from the original object, it will not be possible to elide this local copy operation.

Simple Types

Next, let's look at the performance for the same code example with a simple type, like int . Looking at the generated assembly here, we see that there is only one remote_read operation taking place inside the loop (line 51 of the assembly). Our partial specialization for simple types is evidently being used.

As an added bonus, the compiler is intelligently able to elide making an extra local copy of the int object, since the deserialization function simply returns the original int, instead of reading and writing to a buffer in a complicated loop. The assembled code is almost exactly the same as we would get by reading and writing to raw integers instead of using containers, and it performs the same.

Expanding to all Trivially Copyable Types

There are a lot of simple types, like int , float , double , uint64_t , etc., that are byte-copyable, and it would be nice to support all of them without having to write a serialization struct for each one. It turns out that the C++ standard library has a formal definition for when types, including user-defined types, are "trivially copyable," and we can use this to create one template struct that will provide a serializer for all trivially copyable types.

template <typename T> struct serialize <T, std::enable_if_t< std::is_trivially_copyable<T>::value >> { T serialize(const T& value) { return value; } T deserialize(const T& value) { return value; } };

This allows users to even define their own trivially-copyable types and automatically use them with our container class.

Conclusions and Caveats

So we've created a container abstraction that we can use to store any C++ object in distributed memory, given that it is trivially copyable or we can define a serialization struct for it. This kind of mechanism is crucial if we are to develop distributed data structure types, like lists and hash tables, that can store arbitrarily complex C++ objects. There are a few caveats that come up when using this container abstraction that are worth mentioning.

Local Copy Elision

GCC and Clang are smart enough to avoid creating a local copy of a container when the serialization function just involves returning the object stored inside the container. However, when performing block copies, the loop trips up the compiler, and we end up creating an array of containers, then an array of values extracted from the container that is byte-identical to the containers. I solve this problem in practice by writing special block copy functions for copying multiple containers stored using a remote pointer. These copy functions can use partial specialization to take a fast path when the container is using an identity serialization method.

In-Place Operations on Containers

Some data types support in-place operations in distributed memory, for example integer types, which support atomic compare-and-swap, fetch-and-add, etc. operations. A simple solution that allows users to take advantage of these in-place operations is to have a decay_container method that will decay a remote pointer to a container to a remote pointer to a value when the container identity serializes the type it's storing, as is the case with integer types. A more robust method would be to do this through reference types, which could be specialized based on the serialization function, but this is something I'm still exploring.

Footnotes

[1] Remote pointers are C++ objects that can reference memory on processes that might live on other nodes in a cluster. They can be used almost identically to regular C++ pointers. I've written a blog post on how to design remote pointer objects for C++.