In this special guest feature, Glenn Lockwood from NERSC shares his impressions of ISC 2019 from an I/O perspective.

I was fortunate enough to attend the ISC HPC conference this year, and it was a delightful experience from which I learned quite a lot. For the benefit of anyone interested in what they have missed, I took the opportunity on the eleven-hour flight from Frankfurt to compile my notes and thoughts over the week.

I spent most of my time in and around the sessions, BOFs, and expo focusing on topics related to I/O and storage architecture, so that comprises the bulk of what I’ll talk about below. Rather than detail the conference chronologically as I did for SC’18 though, I’ll only mention a few cross-cutting observations and trends here.

I’ll also not detail the magnificent HPC I/O in the Data Center workshop here, but anyone reading this who cares about storage or I/O should definitely flip through the slides on the HPC-IODC workshop website! This year HPC-IODC and WOPSSS merged their programs, resulting in a healthy mix of papers (in both CS research and applied research), expert talks, and fruitful discussion.

High-level observations

As is often the case for ISC, there were a few big unveilings early in the week. Perhaps the largest was the disclosure of several key architectural details surrounding the Aurora exascale system to be deployed at Argonne in 2021. TACC’s Frontera system, a gigantic Dell cluster stuffed with Intel Cascade Lake Xeons, made its debut on the Top500 list as well. In this sense, Intel was in good form this year. And Intel has to be, since only one of the handful of publicly disclosed pre-exascale (Perlmutter and Fugaku) and exascale systems (Frontier) will be using Intel parts.

The conference had also had an anticipatory undertone as these pre-exascale and exascale systems begin coming into focus. The promise of ARM as a viable HPC processor technology is becoming increasingly credible as Sandia’s Astra machine, an all-ARM cluster integrated by HPE, appeared throughout the ISC program. These results are paving the way for Fugaku (the “post-K” machine), which will prove ARM and its SVE instruction set at extreme scale.

Also contributing to the anticipatory undertone was a lot of whispering that occurred outside of the formal program. The recently announced acquisition of Cray by HPE was the subject of a lot of discussion and conjecture, but it was clear that the dust was far from settled and nobody purported to have a clear understanding of how this would change the HPC market. There was also some whispering about a new monster Chinese system that was on the cusp of making this year’s ISC Top500. Curiously, the Wuxi supercomputer center (where Tianhe-2 is housed) had a booth on the show floor, but it was completely vacant.

Also noticeably absent from the show floor was NVIDIA, although they certainly sent engineers to participate in the program. By comparison, AMD was definitely present, although they were largely promoting the impending launch of Rome rather than their GPU lineup. A number of HPC solutions providers were excited about Rome because of both high customer demand and promising early performance results, and there wasn’t a single storage integrator with whom I spoke that wasn’t interested in what doors will open with an x86 processor and a PCIe Gen4 host interface.

Intel disclosures about Aurora 2021

Perhaps the biggest news of the week was a “special event” presentation given by Intel’s Rajeeb Hazra which disclosed a number of significant architectural details around the Aurora exascale system being deployed at Argonne National Laboratory in 2021.

Nodes will be comprised of Intel Xeon CPUs and multiple Intel GPUs

Intel has confirmed that Aurora will be built on Intel-designed general-purpose GPUs based on the “Xe” architecture with multiple GPUs per node. With this disclosure and the knowledge that nodes will be connected with Cray’s Slingshot interconnect, it is now possible to envision what a node might look like. Furthermore, combining the disclosure of a high GPU:CPU ratio, the Aurora power budget, and some vague guessing at the throughput of a 2021 GPU narrows down the number of nodes that we may expect to see in Aurora.

Although no specific features of the Intel GPUs were disclosed, Intel was also promoting their new AVX512-VNNI instructions to position their latest top-bin Xeon cores as the best option for inference workloads. Coupled with what we can assume will be highly capable GPUs for training acceleration, Intel is building a compelling story around their end-to-end AI portfolio. Interestingly, news that NVIDIA is partnering with ARM dropped this past week, but NVIDIA’s noted absence from ISC prevented a comparable ARM-NVIDIA AI solution from shining through.

System will have over 10 PB of system memoryAurora will have a significant amount of memory presumably comprised of a combination of HBM, DDR, and/or Optane persistent memory. The memory capacity is markedly higher than that of the AMD-based Frontier system, suggesting that Intel may be leveraging Optane persistent memory (which has a lower cost per bit than DDR) to supplement the HBM that is required to feed such a GPU-heavy architecture.

The storage subsystem will deliver over 230 PB of capacity at over 25 TB/sec

Perhaps the most interesting part of Aurora is its I/O subsystem, which will use an object store and an all-solid-state storage architecture instead of the traditional parallel file system. This will amount to 230 PB of usable flash capacity that can operate in excess of 25 TB/sec. Although I’ll describe this storage architecture in more depth below, combining the performance point of 25 TB/sec with the aforementioned high GPU:CPU ratio suggests that each compute node will be able to inject a considerable amount of I/O traffic into the fabric. This points to very capable Xeon cores and very capable NICs.

The programming model for the system will utilize SYCL

Intel has announced that its “One API’ relies on the Khronos Group’s SYCL standard for heterogeneous programming in C++ rather than the incumbent choices of OpenMP, OpenACC, or OpenCL. This does not mean that OpenMP, OpenACC, and/or OpenCL won’t be supported, but it does reveal where Intel intends to put all of its efforts in enabling its own GPUs and FPGAs for HPC. They further emphasized their desire to keep these efforts open, standards-based, and portable, undoubtedly demonstrating stark contrast with the incumbent GPU vendors. This is an interesting long-term differentiator, but time will tell whether SYCL is able to succeed where OpenCL has failed and gain a foothold in the HPC ecosystem.

DAOS will be HPC’s gateway drug to object stores

DAOS (the “Distributed Asynchronous Object Store,” pronounced like it’s spelled) is an object store that Intel has been developing for the better part of a decade in collaboration with the US Department of Energy. The DAOS name has become overloaded in recent years as a result of it changing scope, focus, and chief architects, and the current version is quite different from the original DAOS that was prototyped as a part of the DOE Fast Forward program (e.g., only one of three original DAOS components, DAOS-M, survives). A few key features remain the same, though:

It remains an object store at its core, but various middleware layers will be provided to expose alternate access APIs and semantics

It is specifically designed to leverage Intel Optane persistent memory and NAND-based flash to deliver extremely high IOPS in addition to high streaming bandwidth

It relies on user-space I/O via Mercury and SPDK to enable its extreme I/O rates

Its storage architecture is still based on a hierarchy of servers, pools, containers, and objects

Object stores have historically not found success in HPC due to HPC apps’ general dependence on POSIX-based file access for I/O, but the Aurora DAOS architecture cleverly bridges this gap. I was lucky enough to run into Johann Lombardi, the DAOS chief architect, at the Intel booth, and he was kind enough to walk me through a lot of the details.

DAOS will provide seamless integration with a POSIX namespace by using Lustre’s new foreign layout feature which allows an entity in the Lustre namespace to be backed by something that is not managed by Lustre. In practice, a user will be able to navigate a traditional file namespace that looks like any old Lustre file system using the same old ls and cd commands. However, some of the files or directories in that namespace may be special DAOS objects, and navigating into a DAOS-based object transparently switches the data path from one that uses the traditional Lustre client stack to one that uses the DAOS client stack. In particular,

Navigating into a directory that is backed by a DAOS container will cause the local DAOS agent to mount that DAOS container as a POSIX namespace using FUSE and junction it into the Lustre namespace. Files and subdirectories contained therein will behave as regular POSIX files and subdirectories for the most part, but they will only honor a subset of the POSIX consistency semantics.

Accessing a file that is backed by a DAOS container (such as an HDF5 file) will cause the client to access the contents of that object through whatever API and semantics the DAOS adapter for that container format provides.

DAOS also includes a preloadable library which allows performance-sensitive applications to bypass the FUSE client entirely and map POSIX API calls to DAOS native API calls. For applications that use middleware such as HDF5 or MPI-IO, I/O will be able to entirely bypass the POSIX emulation layer and get the highest performance through DAOS-optimized backends. In the most extreme cases, applications can also write directly against the DAOS native object API to control I/O with the finest granularity, or use one of DAOS’s addon APIs that encapsulate other non-file access methods such as key-value or array operations.

A significant amount of this functionality is already implemented, and Intel was showing DAOS performance demos at its booth that used both IOR (using the DAOS-native backend) and Apache Spark:

The test hardware was a single DAOS server with Intel Optane DIMMs and two Intel QLC NAND SSDs and demonstrated over 3 GB/sec on writes and over a million read IOPS on tiny (256-byte) transfers. Johann indicated that their testbed hardware is being scaled up dramatically to match their extremely aggressive development schedule, and I fully expect to see performance scaling results at SC this November.

This is all a far cry from the original Fast Forward DAOS, and this demo and discussion on the show floor was the first time I felt confident that DAOS was not only a good idea, but it was a solution that can realistically move HPC beyond the parallel file system. Its POSIX compatibility features and Lustre namespace integration provide enough familiarity and interoperability to make it something usable for the advanced HPC users who will be using the first exascale machines.

At the same time, it applies a number of new technologies in satisfying ways (Mercury for user-space network transport, GIGA+ for subtree sharding, Optane to coalesce tiny I/Os, …) that, in most ways, puts it at technological parity with other high-performance all-flash parallel storage systems like WekaIO and VAST. It is also resourced at similar levels, with DOE and Intel investing money and people in DAOS at levels comparable to the venture capital that has funded the aforementioned competitors. Unlike its competitors though, it is completely open-source and relies on standard interfaces into hardware (libfabric, SPDK) which gives it significant flexibility in deployment.

As with everything exascale, only time will tell how DAOS works in practice. There are plenty of considerations peripheral to performance (data management policies, system administration, and the like) that will also factor into the overall viability of DAOS as a production, high-performance storage system. But so far DAOS seems to have made incredible progress in the last few years, and it is positioned to shake up the HPC I/O discussion come 2021.

The Cloud is coming for us

This ISC also marked the first time where I felt that the major cloud providers were converging on a complete HPC solution that could begin eroding campus-level and mid-range HPC. Although application performance in the cloud has historically been the focus of most HPC-vs-cloud debate, compute performance is largely a solved problem in the general sense. Rather, data—its accessibility, performance, and manageability—has been the single largest barrier between most mid-range HPC users and the cloud. The convenience of a high-capacity and persistent shared namespace is a requirement in all HPC environment, but there have historically been no painless ways to produce this environment in the cloud.

AWS was the first to the table with a solution in Amazon FSx, which is a managed Lustre-as-a-service that makes it much easier to orchestrate an HPC workflow that relies on a high-performance, high-capacity, shared file system. This has prompted the other two cloud vendors to come up with competing solutions: Microsoft Azure’s partnership with Cray is resulting in a ClusterStor Lustre appliance in the cloud, and Google Cloud will be offering DDN’s EXAScaler Lustre appliances as a service. And Whamcloud, the company behind Lustre, offers its own Lustre Cloud Edition on all three major cloud platforms.

In addition to the big three finally closing this gap, a startup called Kmesh burst on to the I/O scene at ISC this year and is offering a cloud-agnostic solution to providing higher-touch parallel file system integration and management in the cloud for HPC. Vinay Gaonkar, VP of Products at Kmesh, gave insightful presentations at several big I/O events during the week that spoke to the unique challenges of designing Lustre file systems in a cloud ecosystem. While architects of on-prem storage for HPC are used to optimizing for price-performance on the basis of purchasing assets, optimizing price-performance from ephemeral instance types often defies conventional wisdom; he showed that instance types that may be considered slow on a computational basis may deliver peak I/O performance at a lower cost than the beefiest instance available:

Vinay’s slides are available online and offer a great set of performance data for high-performance storage in the public clouds.

The fact that there is now sufficient market opportunity to drive these issues to the forefront of I/O discussion at ISC is an indicator that the cloud is becoming increasingly attractive to users who need more than simple high-throughput computing resources.

Even with these sorts of parallel file systems-as-a-service offerings though, there are still non-trivial data management challenges when moving on-premise HPC workloads into the cloud that result from the impedance mismatch between scientific workflows and the ephemeral workloads for which cloud infrastructure is generally designed. At present, the cost of keeping active datasets on a persistent parallel file system in the cloud is prohibitive, so data must continually be staged between an ephemeral file-based working space and long-term object storage. This is approximately analogous to moving datasets to tape after each step of a workflow, which is unduly burdensome to the majority of mid-scale HPC users.

However, such staging and data management issues are no longer unique to the cloud; as I will discuss in the next section, executing workflows across multiple storage tiers is no longer a problem unique to the biggest HPC centers. The solutions that address the burdens of data orchestration for on-premise HPC are likely to also ease the burden of moving modest-scale HPC workflows entirely into the cloud.

Tiering is no longer only a problem of the rich and famous

Intel started shipping Optane persistent memory DIMMs earlier this year, and the rubber is now hitting the road as far as figuring out what I/O problems it can solve at the extreme cutting edge of HPC. At the other end of the spectrum, flash prices have now reached a point where meat-and-potatoes HPC can afford to buy it in quantities that can be aggregated into a useful tier. These two factors resulted in a number of practical discussions about how tiering can be delivered to the masses in a way that balances performance with practicality.

The SAGE2 project featured prominently at the high-end of this discussion. Sai Narasimhamurthy from Seagate presented the Mero software stack, which is the Seagate object store that is being developed to leverage persistent memory along with other storage media. At a distance, its goals are similar to those of the original DAOS in that it provides an integrated system that manages data down to a disk tier. Unlike the DAOS of today though, it takes on the much more ambitious goal of providing a PGAS-style memory access model into persistent storage.

On the other end of the spectrum, a number of new Lustre features are rapidly coalescing into the foundation for a capable, tiered storage system. At the Lustre/EOFS BOF, erasure coded files were shown on the roadmap for the Lustre 2.14 release in 2Q2020. While the performance of erasure coding probably makes it prohibitive as the default option for new files on a Lustre file system, erasure coding in conjunction with Lustre’s file-level replication will allow a Lustre file system to store, for example, hot data in an all-flash pool that uses striped mirrors to enable high IOPS and then tier down cooler data to a more cost-effective disk-based pool of erasure-coded files.

In a similar vein, Andreas Dilger also discussed future prospects for Lustre at the HPC I/O in the Data Center workshop and showed a long-term vision for Lustre that is able to interact with both tiers within a data center and tiers across data centers:

Many of these features already exist and serve as robust building blocks from which a powerful tiering engine could be crafted.

Finally, tiering took center stage at the Virtual Institute for I/O and IO-500 BOF at ISC with the Data Accelerator at Cambridge beating out OLCF Summit as the new #1 system. A key aspect of Data Accelerator’s top score arose from the fact that it is an ephemeral burst buffer system; like Cray DataWarp, it dynamically provisions parallel file systems for short-term use. As a result of this ephemeral nature, it could be provisioned with no parity protection and deliver a staggering amount of IOPS.

Impressions of the industry

As I’ve described before, I often learn the most by speaking one-on-one with engineers on the expo floor. I had a few substantive discussions and caught on to a few interesting trends.

No winners in EDSFF vs. NF.1

It’s been over a year since Samsung’s NF.1 (formerly M.3 and NGSFF) and Intel’s EDSFF (ruler) SSD form factor SSDs, and most integrators and third-party SSD manufacturers remain completely uncommitted to building hardware around one or the other. Both form factors have their pros and cons, but the stalemate persists by all accounts so far. Whatever happens to break this tie, it is unlikely that it will involve the HPC market, and it seems like U.2 and M.2 remain the safest bet for the future.

Memory Landscape and Competition

The HBM standard has put HMC (hybrid memory cube) in the ground, and I learned that Micron is committed to manufacturing HBM starting at the 2e generation. Given that SK Hynix is also now manufacturing HBM, Samsung may start to face competition in the HBM market as production ramps up. Ideally this brings down the cost of HBM components in the coming years, but the ramp seems to be slow, and Samsung continues to dominate the market.

Perhaps more interestingly, 3DXPoint may be diversifying soon. Although the split between Intel and Micron has been well publicized, I failed to realize that Intel will also have to start manufacturing 3DXPoint in its own fabs rather than the shared facility in Utah. Micron has also announced its commitment to the NVDIMM-P standard which could feasibly blow open the doors on persistent memory and non-Intel processor vendors to support it. However, Micron has not committed to an explicit combination of 3DXPoint and NVDIMM-P.

Realistically, the proliferation of persistent memory based on 3DXPoint may be very slow. I hadn’t realized it, but not all Cascade Lake Xeons can even support Optane DIMMs; there are separate SKUs with the requisite memory controller, suggesting that persistent memory won’t be ubiquitous, even across the Intel portfolio, until the next generation of Xeon at minimum. Relatedly, none of the other promising persistent memory technology companies (Crossbar, Everspin, Nantero) had a presence at ISC.

China

The US tariffs on Chinese goods are on a lot of manufacturers’ minds. Multiple vendors remarked that they are either

thinking about moving more manufacturing from China into Taiwan or North America,

already migrating manufacturing out of China into Taiwan or North America,

under pressure to make shorter-term changes to their supply chains (such as stockpiling in the US) in anticipation of deteriorating conditions

I was not expecting to have this conversation with as many big companies as I did, but it was hard to avoid.

Beyond worrying about the country of origin for their components, though, none of the vendors with whom I spoke were very concerned about competition from the burgeoning Chinese HPC industry. Several commented that even though some of the major Chinese integrators have very solid packaging, they are not well positioned as solutions providers. At the same time, customers are now requiring longer presales engagements due to the wide variety of new technologies on the market. As a result, North American companies playing in the HPC vertical are finding themselves transitioning into higher-touch sales, complex custom engineering, and long-term customer partnerships.

Concluding thoughts

This year’s ISC was largely one of anticipation of things to come rather than demonstrations that the future has arrived. Exascale (and the pre-exascale road leading to it) dominated most of the discussion during the week. Much of the biggest hype surrounding exascale has settled down, and gone are the days of pundits claiming that the sky will fall when exascale arrives due to constant failures, impossible programming models, and impossible technologies. Instead, exascale is beginning to look very achievable and not unduly burdensome: we know how to program GPUs and manycore CPUs already, and POSIX file-based access will remain available for everyone. Instead, the challenges are similar to what they’ve always been–continuing to push the limits of scalability in every part of the HPC stack.

I owe my sincerest thanks to the organizers of ISC, its sessions, and the HPC-IODC workshop for putting together the programs that spurred all of the interesting discourse over the week. I also appreciate the technical staff at many of the vendor booths with whom I spoke. I didn’t name every person with whom I drew insights on the expo floor, but if you recognize a comment that you made to me in this post and want credit, please do let me know–I’d be more than happy to. I also apologize to all the people with whom I spoke and sessions I attended but did not include here; not everything I learned last week fit here.