Ben's Blog

Creating Expressive C++ Smart Pointers for Remote Memory

In my work as a graduate student, I build distributed data structures, and having a remote pointer abstraction is an essential tool for me to write clean, correct code. In this blog post, I'll motivate why remote pointers are necessary, explain how I built C++ remote pointer objects in my library, explain how to make them work like regular C++ pointers using remote reference objects, and discuss when the abstraction breaks down because a custom pointer simply can't do a regular pointer's job (yet). Hopefully this will be useful for other people interested in developing high-level programming abstractions.

Low-Level APIs

When working on distributed computers or with network hardware, you often have access to read and write to some memory space using a C API. One example of this is MPI's API for one-sided communication. In that API, you use functions to read and write directly from the memory of other nodes in a distributed cluster. Here's a slightly simplified version of what this looks like.

void remote_read(void* dst, int target_node, int offset, int size); void remote_write(void* src, int target_node, int offset, int size);

Given an offset into a target node's shared memory segment, remote_read will read some bytes from it, and remote_write will write some bytes to it.

These APIs are great because they give us access to important primitives we can use to implement programs that run on clusters of computers. They're also great because these are really fast primitives that directly mirror what's offered by the hardware: remote direct memory access (RDMA). Modern supercomputer networks like Cray Aries and Mellanox EDR offer 1-2 μs read/write latencies, which are made possible by allowing the network interface card (NIC) to directly read and write to RAM so as to avoid waiting for the remote CPU to wake up and answer your network request.

Where these APIs are not so great is from an application programming perspective. Even with the simple APIs written above, it's easy to accidentally overwrite data you didn't mean to, since you don't have separate names for separate objects stored in memory, just one big contiguous buffer. Also, the interface is untyped, so you don't have the esteemed benefit of having a compiler yell at you if you write the wrong type of value to the wrong location; your code will simply be wrong, in mysterious yet catastrophic ways. Complicating things more, in reality, these APIs are a bit more complicated, and it's easy to swap two or more parameters by mistake.

Remote Pointers

Pointers are an important and necessary abstraction layer for building high-level programming tools. While pointers can be complicated to use directly and can introduce lots of bugs, they are fundamental building blocks, and data structures and even C++ references will often use pointers under the hood.

If we assume we have an API like the one listed above, then a unique location in remote memory is specified by two things: (1) a rank, or process ID, and (2) an offset into that rank's shared segment of remotely available memory. We can go ahead and make this a struct.

template <typename T> struct remote_ptr { size_t rank_; size_t offset_; };

template <typename T> T rget(const remote_ptr<T> src) { T rv; remote_read(&rv, src.rank_, src.offset_, sizeof(T)); return rv; } template <typename T> void rput(remote_ptr<T> dst, const T& src) { remote_write(&src, dst.rank_, dst.offset_, sizeof(T)); }

remote_ptr<int> ptr = ...; int rval = rget(ptr); rval++; rput(ptr, rval);

Pointer Arithmetic

Pointer arithmetic is an essential feature that allows programmers to manage collections of values in memory—and if we're writing a distributed memory program, presumably we often want to work with large collections of values.

What does incrementing or decrementing a remote pointer mean? Well, the simplest option is that remote pointer arithmetic is analogous to regular pointer arithmetic: p+1 will just point to the next sizeof(T) -aligned memory location after p on the original rank's shared segment.

While this is not the only possible definition of remote pointer arithmetic, it has come to be most accepted recently, with remote pointers in libraries like UPC++, DASH, and BCL using it. However, Unified Parallel C (UPC), which has a long legacy in the high-performance computing (HPC) community, has a more elaborate definition of pointer arithmetic [1].

Implementing pointer arithmetic this way is straightforward and only involves modifying the pointer offset.

template <typename T> remote_ptr<T> remote_ptr<T>::operator+(std::ptrdiff_t diff) { size_t new_offset = offset_ + sizeof(T)*diff; return remote_ptr<T>{rank_, new_offset}; }

This opens up the possibility of accessing arrays of data in distributed memory. We could, for example, have each process in a SPMD program write or read from a different variable in an array pointed to by a remote pointer [2].

void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { rput(ptr + my_rank(), my_rank()); } }

Implementing other operators necessary to support the full set of normal pointer arithmetic operations is straightforward.

Picking a nullptr Value

For normal pointers, the nullptr value is NULL , which is usually just a #define to 0x0, since that memory location is unlikely to be used. In our remote pointer scheme, we have the option of either picking a specific pointer value to be nullptr , thus making that memory location unusable, or else including a special boolean member to mark whether a pointer is null. While making a memory location unusable is not optimal, adding a single boolean value will, in most compilers, double the size of the remote pointer object from 128 bits to 256 bits in order to maintain alignment. This is particularly undesirable. In my library, I chose {0, 0} , that is, offset 0 in rank 0, as the nullptr value.

There are, perhaps, other choices of nullptr that would work equally well. In addition, some programming environments, such as UPC, have implemented narrow pointers that fit in only 64 bits, allowing them to be used in atomic compare-and-swap operations. A narrow pointer comes with the tradeoff that either the offset or the rank identifier must fit within 32 bits or less, limiting scalability.

Remote References

In languages like Python, the bracket operator serves as syntactic sugar to call the methods __setitem__ and __getitem__ , depending on whether you are reading or writing to the object. In C++, operator[] does not distinguish what kind of value category the object is or whether the returned value will ultimately be read from or written to. To resolve this, C++ data structures return references, which refer to memory held inside the container and can be either written to or read from. An implementation of operator[] for std::vector might look something like this.

T& operator[](size_t idx) { return data_[idx]; }

Of the essence here is the fact that we are returning something of type T& , which is a raw C++ reference that can be written to, rather than something of type T , which only represents the value of the original data.

We cannot return a raw C++ reference in our case because we are referencing memory held on another node which is not present in our own virtual address space. We can, however, create our own custom reference objects.

A reference is essentially an object that serves as a wrapper around a pointer, and it has two essential functions: it can be implicitly converted to a value of type T , and it can be assigned to a value of type T . So, for a remote reference, we just have to implement an implicit conversion operator, which reads the value, and an assignment operator, which writes to the value.

template <typename T> struct remote_ref { remote_ptr<T> ptr_; operator T() const { return rget(ptr_); } remote_ref& operator=(const T& value) { rput(ptr_, value); return *this; } };

This allows us to add some new, powerful features to our remote pointer that allow it to be dereferenced in the same way as regular pointers.

template <typename T> remote_ref<T> remote_ptr<T>::operator*() { return remote_ref<T>{*this}; } template <typename T> remote_ref<T> remote_ptr<T>::operator[](ptrdiff_t idx) { return remote_ref<T>{*this + idx}; }

We've now filled in all the pieces to use our remote pointers as regular pointers. We can re-write the simple example program from above.

void write_array(remote_ptr<int> ptr, size_t len) { if (my_rank() < len) { ptr[my_rank()] = my_rank(); } }

We can, of course, write more complicated programs using our new pointer API, such as this function to do a parallel tree reduction [3]. Implementations using our remote pointer class are safer and cleaner than what would typically be written using the C API described above.

Runtime Cost (Or Lack Thereof!)

But what is the cost of using a high-level abstraction like this? With each memory access, we are calling the dereference method, returning an intermediate object that wraps the pointer, then calling the conversion operator or assignment operator on the intermediate object. Will we pay a runtime cost for this?

It turns out, if we design our pointer and reference classes carefully, we will pay no runtime cost for this abstraction—modern C++ compilers are able to compile these intermediate objects and method calls away through aggressive inlining. To judge what the cost of our abstraction is, we can compile a simple example program and then examine the assembly to see what objects and methods exist at runtime. For the tree reduction example, compiled with the remote pointer and reference classes we've described here, modern compilers render the tree reduction as a few calls to remote_read and remote_write [4]. No class methods are called and no reference objects exist at runtime.

Interoperating with Data Structure Libraries

Experienced C++ programmers will recall that the C++ standard template library specifies that STL containers should support custom C++ allocators. Allocators provide a way to allocate memory, and this memory can be referenced using custom pointer types. Does this mean that we can simply create a "remote allocator" and plug it in to store data in remote memory using STL containers?

Unforunately, no. Presumably for performance reasons, the C++ standard no longer requires support for custom reference types, and most implementations of the C++ standard library do not support them. If you want to use GCC's libstdc++, for example, you may use custom pointers, but are restricted to regular C++ references, which prevents you from using STL containers in remote memory. Some high-level C++ template libraries like Agency, which uses custom pointer and reference types, have their own implementations of some STL data structures that do support remote reference types. This allows users a lot of free license to build creative allocator, pointer, and reference types and then have a collection of data structures you can automatically use with them.

Broader Issues

This blog post has touched on a lot of broader issues that remain unresolved.

Memory allocation. Now that we can reference objects in remote memory, how do we reserve or allocate remote memory to begin with?

Now that we can reference objects in remote memory, how do we reserve or allocate remote memory to begin with? Support for objects. What about storing types more complicated than int in remote memory? Can we cleanly support complex types? Can we support simple types at the same time without paying a serialization cost?

What about storing types more complicated than in remote memory? Can we cleanly support complex types? Can we support simple types at the same time without paying a serialization cost? Distributed data structure design. Now that we have these abstractions, what kind of data structures and applications can we build with them? What abstractions should we use to distribute data?

I plan on writing a series of blog posts that deal with these and other issues that have arisen while developing my distributed data structures library. Please stay tuned!

Footnotes

[1] In UPC, pointers have a phase that determines which rank a pointer will point to when incremented. Phase allows pointers to encapsulate distributed arrays with a variety of distribution patterns. This is very powerful, but can be mystifying to novice users. Although some expert UPC programmers still prefer this technique, a more object-oriented approach that's catching on is to first build a simple remote pointer class, then rely on separate distributed data structures to control data distribution.

[2] Most applications in HPC are written using a SPMD, or "single program, multiple data" style. SPMD APIs offer a my_rank() function or variable, which tells a process running the program its unique rank or process ID and can be used to branch off from the main program.

[3] Here is a basic tree reduction written in a SPMD style using our remote pointer class. Code is adapted from code originally written by my colleague Andrew Belt

template <typename T> T parallel_sum(remote_ptr<T> a, size_t len) { size_t k = len; do { k = (k + 1) / 2; if (my_rank() < k && my_rank() + k < len) { a[my_rank()] += a[my_rank() + k]; } len = k; barrier(); } while (k > 1); return a[0]; }

[4] You can see the compiled result of the above code here: https://godbolt.org/z/yDKluF.