A look at The Machine

Benefits for LWN subscribers The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

In what was perhaps one of the shortest keynotes on record (ten minutes), Keith Packard outlined the hardware architecture of "The Machine"—HP's ambitious new computing system. That keynote took place at LinuxCon North America in Seattle and was thankfully followed by an hour-long technical talk by Packard the following day (August 18), which looked at both the hardware and software for The Machine. It is, in many ways, a complete rethinking of the future of computers and computing, but there is a fairly long way to go between here and there.

The hardware

The basic idea of the hardware is straightforward. Many of the "usual computation units" (i.e. CPUs or systems on chip—SoCs) are connected to a "massive memory pool" using photonics for fast interconnects. That leads to something of an equation, he said: electrons (in CPUs) + photons (for communication) + ions (for memory storage) = computing. Today's computers transfer a lot of data and do so over "tiny little pipes". The Machine, instead, can address all of its "amazingly huge pile of memory" from each of its many compute elements. One of the underlying principles is to stop moving memory around to use it in computations—simply have it all available to any computer that needs it.

Some of the ideas for The Machine came from HP's DragonHawk systems, which were traditional symmetric multiprocessing systems, but packed a "lot of compute in a small space". DragonHawk systems would have 12TB of RAM in an 18U enclosure, while the nodes being built for The Machine will have 32TB of memory in 5U. It is, he said, a lot of memory and it will scale out linearly. All of the nodes will be connected at the memory level so that "every single processor can do a load or store instruction to access memory on any system".

Nodes in this giant cluster do not have to be homogeneous, as long as they are all hooked to the same memory interconnect. The first nodes that HP is building will be homogeneous, just for pragmatic reasons. There are two circuit boards on each node, one for storage and one for the computer. Connecting the two will be the "next generation memory interconnect" (NGMI), which will also connect both parts of the node to the rest of the system using photonics.

The compute part of the node will have a 64-bit ARM SoC with 256GB of purely local RAM along with a field-programmable gate array (FPGA) to implement the NGMI protocol. The storage part will have four banks of memory (each with 1TB), each with its own NGMI FPGA. A given SoC can access memory elsewhere without involving the SoC on the node where the memory resides—the NGMI bridge FPGAs will talk to their counterpart on the other node via the photonic interface. Those FPGAs will eventually be replaced by application-specific integrated circuits (ASICs) once the bugs are worked out.

ARM was chosen because it was easy to get those vendors to talk with the project, Packard said. There is no "religion" about the instruction set architecture (ISA), so others may be used down the road.

Eight of these nodes can be collected up into a 5U enclosure, which gives eight processors and 32TB of memory. Ten of those enclosures can then be placed into a rack (80 processors, 320TB) and multiple racks can all be connected on the same "fabric" to allow addressing up to 32 zettabytes (ZB) from each processor in the system.

The storage and compute portions of each node are powered separately. The compute piece has two 25Gb network interfaces that are capable of remote DMA. The storage piece will eventually use some kind of non-volatile/persistent storage (perhaps even the fabled memristor), but is using regular DRAM today, since it is available and can be used to prove the other parts of the design before switching.

SoCs in the system may be running more than one operating system (OS) and for more than one tenant, so there are some hardware protection mechanisms built into the system. In addition, the memory-controller FPGAs will encrypt the data at rest so that pulling a board will not give access to the contents of the memory even if it is cooled (à la cold boot) or when some kind of persistent storage is used.

At one time, someone said that 640KB of memory should be enough, Packard said, but now he is wrestling with the limits of the 48-bit addresses used by the 64-bit ARM and Intel CPUs. That only allows addressing up to 256TB, so memory will be accessed in 8GB "books" (or, sometimes, 64KB "booklettes"). Beyond the SoC, the NGMI bridge FPGA (which is also called the "Z bridge") deals with two different kinds of addresses: 53-bit logical Z addresses and 75-bit Z addresses. Those allow addressing 8PB and 32ZB respectively.

The logical Z addresses are used by the NGMI firewall to determine the access rights to that memory for the local node. Those access controls are managed outside of whatever OS is running on the SoC. So the mapping of memory is handled by the OS, while the access controls for the memory are part of the management of The Machine system as a whole.

NGMI is not intended to be a proprietary fabric protocol, Packard said, and the project is trying to see if others are interested. A memory transaction on the fabric looks much like a cache access. The Z address is presented and 64 bytes are transferred.

The software

Packard's group is working on GPL operating systems for the system, but others can certainly be supported. If some "proprietary Washington company" wanted to port its OS to The Machine, it certainly could. Meanwhile, though, other groups are working on other free systems, but his group is made up of "GPL bigots" that are working on Linux for the system. There will not be a single OS (or even distribution or kernel) running on a given instance of the The Machine—it is intended to support multiple different environments.

Probably the biggest hurdle for the software is that there is no cache coherence within the enormous memory pool. Each SoC has its own local memory (256GB) that is cache coherent, but accesses to the "fabric-attached memory" (FAM) between two processors are completely uncoordinated by hardware. That has implications for applications and the OS that are using that memory, but OS data structures should be restricted to the local, cache-coherent memory as much as possible.

For the FAM, there is a two-level allocation scheme that is arbitrated by a "librarian". It allocates books (8GB) and collects them into "shelves". The hardware protections provided by the NGMI firewall are done on book boundaries. A shelf could be a collection of books that are scattered all over the FAM in a single load-store domain (LSD—not Packard's favorite acronym, he noted), which is defined by the firewall access rules. That shelf could then be handed to the OS to be used for a filesystem, for example. That might be ext4, some other Linux filesystem, or the new library filesystem (LFS) that the project is working on.

Talking to the memory in a shelf uses the POSIX API. A process does an open() on a shelf and then uses mmap() to map the memory into the process. Underneath, it uses the direct access (DAX) support to access the memory. For the first revision, LFS will not support sparse files. Also, locking will not be global throughout an LSD, but will be local to an OS running on a node.

For management of the FAM, each rack will have a "top of rack" management server, which is where the librarian will run. That is a fairly simple piece of code that just does bookkeeping and keeps track of the allocations in a SQLite database. The SoCs are the only parts of the system that can talk to the firewall controller, so other components communicate with a firewall proxy that runs in user space, which relays queries and updates. There are a "whole bunch of potential adventures" in getting the memory firewall pieces all working correctly, Packard said.

The lack of cache coherence makes atomic operations on the FAM problematic, as traditional atomics rely on that feature. So the project has added some hardware to the bridges to do atomic operations at that level. There is a fam_atomic library to access the operations (fetch and add, swap, compare and store, and read), which means that each operation is done at the cost of a system call. Once again, this is just the first implementation; other mechanisms may be added later. One important caveat is that the FAM atomic operations do not interact with the SoC cache, so applications will need to flush those caches as needed to ensure consistency.

Physical addresses at the SoC level can change, so there needs to be support for remapping those addresses. But the SoC caches and DAX both assume static physical mappings. A subset of the physical address space will be used as an aperture into the full address space of the system and books can be mapped into that aperture.

Flushing the SoC cache line by line would "take forever", so a way to flush the entire cache when the physical address mappings change has been added. In order to do that, two new functions have been added to the Intel persistent memory library ( libpmem ): one to check for the presence of non-coherent persistent memory ( pmem_needs_invalidate() ) and another to invalidate the CPU cache ( pmem_invalidate() ).

In a system of this size, with the huge amounts of memory involved, there needs to be well-defined support for memory errors, Packard said. Read is easy—errors are simply signaled synchronously—but writes are trickier because the actual write is asynchronous. Applications need to know about the errors, though, so SIGBUS is used to signal an error. The pmem_drain() call will act as a barrier, such that errors in writes before that call will signal at or before the call. Any errors after the barrier will be signaled post-barrier.

There are various areas where the team is working on free software, he said, including persistent memory and DAX. There is also ongoing work on concurrent/distributed filesystems and non-coherent cache management. Finally, reliability, availability, and serviceability (RAS) are quite important to the project, so free software work is proceeding in that area as well.

Even with two separate sessions, it was a bit of a whirlwind tour of The Machine. As he noted, it is an environment that is far removed from the desktop world Packard had previously worked in. By the sound, there are plenty of challenges to overcome before The Machine becomes a working computing device—it will be an interesting process to watch.

[I would like to thank the Linux Foundation for travel assistance to Seattle for LinuxCon North America.]

