Author’s note: This article will introduce readers to modern data transfer technologies capable of increasing data transfer speed inside applications and OS. Many technical specialities concerning the realization of such process are left in the article on purpose. The author recommends to learn about Linux graphic stack and DRAM general memory technology from original documentation and source codes.

Preface

As the volume of processed data grows, so do the requirements for server hardware. Modern Big Data systems, analytical DB and cluster applications are able to process tens of gigabytes per second, have hundreds of terabytes of memory and operate tens of petabytes of data. Proper optimization of such systems can save ten or even hundred millions of dollars for their owners.

Zero-copy and RDMA Introduction

Today we will speak about Zero-copy and RDMA technologies that can accelerate sufficiently almost any DB or cluster application and are used by most Big Data process systems.

The essence of Zero-copy is in the minimization of buffer copy operations by using common memory buffer between applications, or between applications and hardware (as kernel bypassing technologies), or between the kernel and computer hardware. Also application optimization is introduced in the article, minimizing pointless cross-structure buffer copy operations (as a result the total amount of operative buffers needed for this task reduces, thereby freeing up additional buffers that can be used afterwards).

Classical example of Zero-copy can be described as data transfer between two applications inside one physical computer.

Assume, that one application needs to transfer some data to another. Classical case: two programs must open a connection initialized trough API and get two file descriptors — one per each program (and it matters not whether local net connection, pipe or UNIX-socket is used), after that the main data transfer begins. This process is rather slow usually and consists of following stages:

The kernel allocates buffers of a certain size to receive and transmit data (it means if transferred message size is bigger than this buffer, each program have to read and write data several times).

Application-1 puts data into buffer, each time making a system call to the kernel.

The kernel copies buffer into kernel space (although there are inbuilt optimizations of this process in some OS’s) and processes it (this processing can be rather costly in case of TCP/IP-stack).

The kernel moves buffer into memory area of application-2, then initializes context change in that application.

As a result, transferring 10Mb of data through 512b buffer may require more than 8,000 (!) context switches, not to mention possible additional processing at the kernel level, which also requires CPU time. The situation is complicated by the fact that during these switches, the L1 cache and the control registers of the processor are reset, which reduces processor performance.

Buffer size and shared memory in data transfer

The use of massive buffer can partially solve this situation, but because of buffer allocation in minimum three places (the kernel, program-source and program-recipient) this practice can lead to RAM shortage (especially, if data transfer is needed not only between two applications). In addition memory copy requires CPU time and RAM timeout (since those buffers will not be placed n L2 or L3 caches) and will not achieve maximum optimization.

Transferring data using shared memory as a buffer for applications is better.

In this case data transfer is as follows:

Both applications call kernel to create shared region of physical RAM. The kernel configures the Memory Management Unit so that when virtual memory is accessed at the buffer location, both applications access the same physical RAM pages.

Applications initialize specific connection channel (IPC, UNIX-socket, or even net socket).

After that, the transfer itself takes place, which happens now much easier and faster: Application-1 sends data to application-2 just by putting it in buffer and calling kernel that notifies application-2.

The kernel switches context in himself and then to application-2 notifying it about new data in the buffer. Application-2 reads the buffer.

In this case it’s enough to switch context three times to transfer 10Mb of data and utilize shared buffer of 10 Mb. At the same time (in case that CPU cache is big enough for this buffer, parts of OS and application code) applications can transfer data without waiting for RAM since data will be cached by the CPU. It can accelerate real application operations more than ten times.

Practical example

Many UNIX-like OS use protocol X11 for graphics. In the current version it (graphic subsystem) has a lot of improvements related to Zero-copy.

So, let’s examine the following example (in a simplified version): you need to bring 4K (3840x2160x24bit) frame to the frame buffer of video card from the YouTube window.

This operation proceeds as follows (including Kernel and connections setup):

The browser transfers the image buffer data to the Xorg server (via UNIX socket, the size is smaller than the transferred buffer). Xorg server transfers data to the composite manager (via UNIX socket, the size is smaller than the transferred buffer). Composite manager processes image, and also can apply overlay of frame, transparency, overlay of other windows. After that, the generated buffer is transferred to the Xorg server (via a UNIX socket, the size is smaller than the transferred buffer). Xorg server transfers image of the frame into kernel buffer, where an image for the video card frame buffer is formed (via UNIX socket, the size is smaller than the transferred buffer). The kernel transfers image of a frame from RAM into the video card frame buffer (calling video card function to copy RAM data to shadow buffer). The kernel waits data copy to complete the transfer to shadow buffer of video card (via receiving trap instruction from video card). The kernel sends a command to switch primary and shadow buffers. Video card depicts the frame on the screen.

As it is seen from diagram, data transfers many times trough UNIX-sockets between several applications, maintaining the process of kernel switching. Considering that UNIX-sockets usually operate with a maximum of 64Kb per buffer, there are 1140 necessary CPU context switches needed to transfer one frame of 4K video (around 23.7Mb). If we calculate the number of switchings from points 1 to 4, then there will be at least 4560 of them. Also composite manager wastes many CPU tacts to perform graphical operations with windows (overlaying, framing, etc).

Optimization of modern graphical stack in Linux systems

To reduce the load of modern systems, a number of optimizations takes place in practice:

1. Composite Manager uses GPUs for most overlay operations, resizing, etc. Data output is carried out directly in the memory of the graphics card (using OpenGL and DRI), bypassing the X-server. The buffers on the video card side are configured, as well as the DMA configuration on the kernel side through the DRI and X server. After that, the composite manager sends window buffers to the video card’s buffer through a single system call. As a result the kernel sends a read command from the required region of RAM to the graphics card.

2. Many applications also use the output of image buffers of their windows directly to the memory of graphics card (using isolated virtual main and back shadow buffers, as well as a call to the core to initiate copying process from the RAM to the GPU), notifying the composite manager about changes (using OpenGL and DRI).

3. Composite manager use OpenGL and DRI to overlay each other isolated application buffers using shaders processed by GPU.

Conclusion

It is possible to reduce the number of context switches from more than a thousand to dozens times and also to reduce the image processing time using hardware acceleration of the The processor load and RAM also reduces.

Many operations (shared memory between applications, between applications and devices, as well as between applications on different hosts) are accelerated in the same way.

We will examine the RDMA technology in the next article and also will see how network delays, processor context switches, and the network stack itself can reduce performance by a hundredfold.

In Anna Systems we maximize the use of modern approaches in computations: in simulation systems of hydrodynamics, electromagnetic fields, neural networks and other high-performance and demanding software. Various technologies for reducing delays are also actively used by our team: from competent orchestration (preventing cache leaching) and zero-copy to MPI-over-RDMA. The use of this technology stack allows to use computer more efficiently, and, therefore, provide our customers with the results of calculations faster and at lower cost. We strive to maximize integration of business requirements with the final IT solution in our work.

Author:

Simakin Vladimir Andreevich,

IT-Architect ANNA Systems