China’s massive Sunway TaihuLight supercomputer sent ripples through the computing world last year when it debuted in the number-one spot on the Top500 list of the world’s fastest supercomputers. Delivering 93,000 teraflops of performance – and a peak of more than 125,000 teraflops – the system is nearly three times faster than the second supercomputer on the list (the Tianhe-2, also a Chinese system) and dwarfs the Titan system Oak Ridge National Laboratory, a Cray-based machine that is the world’s third-fastest system, and the fastest in the United States.

However, it wasn’t only the system’s performance that garnered a lot of attention. It also was the fact that the supercomputer was powered by Sunway’s many-core SW26010 processors — built in China – rather than chips from well-known US players like Intel, AMD or Nvidia. As we’ve talked about before, the TaihuLight system and the the SW26010 chips it runs on are part of a larger push by Chinese officials to have more components for Chinese systems made in China rather than by US vendors, an effort that is fueled by a number of factors, from national security issues to national competitive pride. Another part of that push is China’s plan to spend $150 billion over 10 years to build out the country’s chip-making capabilities.

The chip itself is not overly impressive by the numbers — Jack Dongarra of the University of Tennessee and Oak Ridge National Laboratory outlined the current state of the high-performance computing space and the challenges it faces, and described the SW26010’s size (built on 28-nanometer technology) and speed (1.45GHz) as “modest” compared with what Intel, AMD and other vendors in the United States are coming out with. However, the supercomputer is powered by more than 10.6 million cores. By comparison, Tianhe-2 is running 3.12 million Intel Xeon E5-2692 cores.

The size and performance capabilities of the supercomputer, which is installed at the National Supercomputing Center in China, makes it an attractive choice when running computationally intensive workloads like computational fluid dynamics (CFD), used to simulate occurrences in a broad range of scientific areas, including meteorology, aerodynamics and environmental sciences. A group of scientists from the Center for High Performance Computing at Shanghai Jiao Tong University in China and the Tokyo Institute of Technology in Japan recently released a paper outlining experiments they conducted running a hybrid implementation of the Open Source Field Operation and Manipulation (OpenFOAM) CFD application on the TaihuLight system. The researchers wanted to see if they could develop a hybrid implementation of the software to overcome a compiler incompatibility situation in the SW26010 processor. They called OpenFOAM was of the most popular CFD applications built on C++.

In their study, titled “Hybrid Implementation and Optimization of OpenFOAM on the SW26010 Many-core Processor,” the researchers laid out the challenge presented by the chip when running C++ programs.

“The processor includes four core groups(CGs), each of which consists of one management processing element (MPE) and sixty-four computing processing elements (CPEs) arranged by an eight by eight grid,” they wrote. “The basic compiler components on MPE support C/C++ programming language, while the compiler components on CPE only support C. The compilation incompatibility problem makes it difficult for C++ programs to exploit the computing power of the SW26010 processor.”

In order to get high performance from the OpenFOAM program while running on the chip, the researchers – Delong Meng, Minhua Wen, Jianwen Wei, James Lin – not only used a mixed-language design for the application, but also leveraged several feature-specific optimizations on the SW26010 on the software. What they did with the OpenFOAM application can also be used with other complex C++ programs to ensure high performance when running on systems powered by the SW26010 processor.

Details of the study can be found here, but one of the key steps was developing a mixed-language programming model for OpenFOAM, in party by modifying the data storage format and reimplementing the kernel code with C language. In addition, on the MPE, they put in a new compilation method for OpenFOAM in which they compile ThirdParty and OpenFOAM with GCC and swg++-4.5.3, respectively, and changed the linking mode of OpenFOAM, using the static library. The optimizations of the MPE included the such areas as vectorization, data presorting and algorithm optimization.

They also took steps for running OpenFOAM on the chip’s CPE cluster, which only supports the C compiler, through such steps as using the master-slave cooperative algorithm of the PCG method and by modifying the library file. Optimizations of the CPE were done in such areas as data structure transformation, register communication, direct memory access (DMA), prefetching, double buffering and data reuse.

The study’s authors then tested the software by running it on both a SW26010 processor and a 2.3GHz Xeon E5-2695 v3 in a test case involving what they described as a “lid-driven cavity flow. The top boundary of the cube is a moving wall that moves in the x-direction, whereas the rest are static walls.” In the tests comparing the performance of the MPE, the CPE cluster and the Intel chip, they found that after optimizing the CPE cluster, there was an 8.03-times performance increase based on the optimized implementation on the MPE. In addition, the CPE cluster was 1.18 times faster than the single-core Intel chip. However, while the CPE cluster performance was better than that of the Intel processor, there were issues with efficiency. Those were due to a smaller cache and scratchpad memory (SPM) size of the SW26010, which means having to repeatedly load data into the SPM and hindering memory access. In addition, the DMA latency was high and the automatic optimizations of the SW26010 applied by the compiler was less efficient than with the Intel chip.

However, the researchers said they proved that the work they did with OpenFOAM to enable it to reach high performance in the SW26010 can be used with other C++ workloads.

“The implementation and results we present demonstrate how complex codes and algorithms can be efficiently implemented on such diverse architectures as hybrid MPE-CPEs systems,” they wrote. “We can hide hardware-specific programming models into libraries and make them general purpose. OpenFOAM is now ready to effectively exploit the new supercomputing system based on the SW26010 processor.”