The Parallel GCC

This page introduces the Parallel GCC -- a research project aiming to parallelize a real-world compiler. This can be useful in many-core machines where GNU Make itself can not provide enough parallelism, or in the future if someone wants to design a parallel compiler from scratch.

In this page, we document and discuss how to use this project, the theoretical backgrounds that motivate this project, challenges faced so far, documentations about what was done and what decisions were made to fix some problems, the architecture developed to control parallelism, some results so far, and what is left to do.

Please, keep in mind that this project is under development and still have several bugs.

Downloading and Building the Project

Clone this branch, checkout to giulianob_parallel, create a new build directory, navigate through it, and then run configure. For example:

$ git clone https://gitlab.com/flusp/gcc.git $ cd gcc $ git checkout giulianob_parallel $ mkdir build && cd build $ ../configure --disable-bootstrap --enable-languages=c --disable-multilib $ make

The only additional dependencies required by this project are the POSIX threads, which are included in most Unix'es.

Using the Parallel GCC

After you install GCC, use:

$ gcc --param=num-threads=4 <PARAMS>

This will make GCC spawn 4 threads to compile the code.

Theoretical Background

Inter and Intra Procedural Optimizations

GCC is an optimizer compiler, which means that it automatically optimizes your code when compiling. GCC splits optimizations into two categories, which are defined as follows:

Intra Procedural Analysis are optimizations which are applied inside a function, ignoring its call and callee relationship. Therefore it uses no information with regard to how this function interacts with the others. One example is the vectorizer.

Inter Procedural Analysis are optimizations which require information about how this function interact with the others. One example is inlining.

From this definition, we can assume that Intra Procedural Optimizations can be performed in parallel between two or more functions.

GCC optimization phase is split into three steps:

Inter Procedural Analysis (IPA) : Builds a callgraph and uses it to decide how to perform optimizations.

GIMPLE Intra Procedural Optimizations : Performs several hardware-independent optimizations inside the function.

RTL Intra Procedural Optimizations: Performs several hardware-dependent optimizations inside the function.

The pipeline works as follows: As IPA collects information and decides how to optimize all functions, it then sends a function to the GIMPLE optimizer, which then sends the function to the RTL optimizer, and the final code is generated. This process repeats for every function in the code. The pseudocode below illustrates this process:

void expand_all_functions () { graph* g = build_callgraph (); ipa_perform_analysis (g); function* cfun; FOR_EACH_FUNCTION (g, cfun) { cfun->expand_ipa (); cfun->expand_gimple (); cfun->expand_rtl (); } }

You can check this part in cgraphunit.c, where expand_all_functions is implemented.

Here we started this project parallelizing GIMPLE, as it is hardware-independent and therefore its parallelization can increase parallelism in all architectures supported by GCC.

Parallel Architecture

We designed the following architecture intending to increase parallelism and reduce overhead. As IPA finishes its analysis, a number of threads equal to the number of logical processors are spawned to avoid scheduling overhead. Then one of those thread inserts all analyzed functions into a threadsafe producer-consumer queue, which all threads are responsible to consume. Once a thread has finished processing one function, it queries for the next function available in the queue, until it finds an EMPTY token. When it happens, the thread should finalize as there are no more functions to be processed.

This architecture is used to parallelize per-function GIMPLE Intra Process Optimizations and can be easily extended to also support RTL Intra Process Optimizations. This, however, does not cover IPA passes nor the per-language Front End analysis.

The current code snippet of this architecture is illustrated below:

void expand_all_functions () { graph* g = build_callgraph (); ipa_perform_analysis (g); function* cfun; working_set ws; FOR_EACH_FUNCTION (g, cfun) { cfun->expand_ipa (); } ws.spawn_threads (expand_gimple): FOR_EACH_FUNCTION (g, cfun) { ws.insert_work (cfun); } ws.join() FOR_EACH_FUNCTION (g, cfun) { cfun->expand_rtl (); } }

Code Refactoring

Several parts of GCC middle-end code was refactored in this project, and there are still many places where code refactoring is necessary for this project to succeed.

First changes were made regarding how these functions are optimized in the pipeline. The original code required a single function to be optimized and outputted from GIMPLE to RTL without any possible change of what function is being compiled. Several structures in GCC were made per-thread or threadsafe, either being replicated by using the C11 thread notation, by allocating the data structure in the thread stack, or simply inserting locks.

One of the most tedious parts of the job was detecting making several global variables threadsafe, and they were the cause of most crashes in this project. Tools made for detecting data-races, such as Helgrind and DRD, were useful in the beginning but then showed its limitations as the project advanced. Several race conditions had a small window and did not happen when the compiler ran inside these tools. Therefore there is a need for better tools to help to find global variables or race conditions. Finding these variables through static analysis of the entire code may be a good addition to these current tools.

In the below subchapters we discuss some data structures which I found to be not easily replicated.

Memory Pools

Memory pools are data structures which allocate several objects of the same type in chunks to avoid calls to the malloc() function, and to ensure that data is always aligned. This serves both as optimization and to avoid memory leaks, as one can free the entire pool at once.

Memory pools were implemented in GCC as a class which all points to one singleton Memory Allocator object, which carries the memory allocation and therefore had a serious race condition when threads tried to allocate and deallocate memory pools. One thread could release a pool which other threads held pointers to, resulting into references to invalid memory, or typical race conditions in this structure with counters, which needs to be increased and decreased as chunks of objects are allocated and released.

As the data structure is required by other threads later in the compilation, which is still carried by a single thread in the current state of this project, our first try was to implemented a Threadsafe Memory Pool allocator, which locks a mutex each time memory is allocated or released and annotates the thread ID on each chunk. Therefore, when memory is released, the thread only releases the chunk they currently own. This approach made the compilation slow, and the GCC tests failed due to time concerns. Therefore, another strategy was designed.

The second approach was to use a distributed memory pool. Each thread holds one memory pool, and as a result, there is no need for locking when allocating and releasing the chunks. This also guarantees that one thread does not release the contents of another thread, as they have no access to pools that belongs to other threads. However, this leads to an issue, as the data is required by another thread later in the compilation. The solution was to implement a pool merge feature, which merges two memory pools upon request. Since memory pools are implemented as a linked list, the merge feature could be implemented in O(1), although the current implemented algorithm requires O(n). The reason for this is that the memory pool currently uses a single-headed linked list, and it needs to be refactored into a double-headed linked list.

All memory pools touched by GIMPLE Intra Process Optimizations, except one, were refactored with this approach, and the merge feature was used only in those memory pools which were required. The only pool which was not refactored using this approach was the Euler Transversal Forest datastructure (et-forest.c), simple because the compiler crashes when the strategy is employed here. The reason for this must still be investigated.

Garbage Collection

GCC has an internal garbage collector, which is a reference counter of objects. Objects that are watched by the garbage collector are declared with the GTY(()) annotation, and we can not simply use C11 thread annotation, as it is not supported by the Garbage Collector. Currently, our approach is either to insert locks in these variables or move it to the struct function object.

Currently, we inserted a global lock in the Garbage Collector to ensure that memory allocation at this point is serialized, and disabled any memory collection when the program is running in multi-threaded mode. This is not necessary when multi-thread is supported by the Garbage Collector.

Memory Address to Symbol Conversion

In tree-ssa-address.c, there is a vector for converting memory references to symbols, detecting if an address is part of a symbol (i. e. a reference to an array element), and vice versa. This array is marked to be watched by the garbage collector, therefore we lock this structure every time this array has to be accessed. Research is needed to evaluate how much this lock impacts performance, and if there is a better way of handling this situation.

Integer to Tree Node Hash

In tree.c, there is a hash table used to avoid reconstruction of tree nodes that represent integer constants. This hash is also marked to be watched by the Garbage Collector, and therefore we used a simple lock to this structure every time this hash is accessed. This approach may not be the best if the cost of locking and hashing becomes greater than recreating the tree node. Therefore research is also needed here.

The rtl_data Structure

GCC uses a single instance of rtl_data class, representing the current function being compiled in RTL. So far, this should not be a problem as RTL expansion and optimization phase is still single-threaded. However, there are GIMPLE passes which calculate instruction costs in RTL mode to decide how the function will be optimized. This access the rtl_data singleton and therefore exposes a race condition that needs to be solved. To fix this issue, we have either to replicate this structure, which is necessary to parallelize the Intra Process RTL optimizations, or fix the GIMPLE pass so that it does not depends on instruction costs.

Wide Int to Tree Conversion

When creating integer constants through wide_int_to_tree function, there is a race condition that makes TYPE_CACHED_VALUES(TREE_CHECK(type)) to return NULL. The reason behind this race condition must be investigated, but the most plausible cause is that there are race conditions regarding how passes apply transformations in trees. The current workaround is to await the tree to be valid again by checking if the return value of the macro is NULL, however, this does not ensure correctness.

Results

Here we present our current performance results by parallelizing the GIMPLE Intra Process Optimizations. It must be highlighted that we are still facing race conditions and there are locks which can be removed, as the data structure can be duplicated.

Here we compile the file gimple-match.c, which are the biggest file in the GCC project. This file has more than 100,000 lines of code, with around 1700 functions, and almost no loops inside these functions. The computer used in this Benchmark had an Intel(R) Core(TM) i5-8250U CPU, with 8Gb of RAM. Therefore, this computer had a CPU with 4 cores with Hyperthreading, resulting in 8 virtual cores. All points are the mean of 30 samples, and the confidence interval to the populational mean was suppressed, as the standard deviation was fairly low.

The figure below shows our results before and after Intra Procedural GIMPLE parallelization. In this figure, we can observe that the time elapsed in this part dropped from 7 seconds to around 4 seconds with 2 threads and around 3 seconds with 4 threads, resulting in a speedup of 1.72x and 2.52x, respectively. Here we can also see that using Hyperthreading did not impact the result. This result was used to estimate the improvement in RTL parallelization.

gimple_parallel.svg

The next figure shows this results when compared with the total compilation time. Here we can see that there is a small improvement of 10% when compiling this file.

real.svg

However, since the same approach can be used to parallelize RTL, we can estimate a speedup of 1.61x in GCC when it gets parallelized by using the speedup information obtained in GIMPLE. The next figure shows this estimate.

gcc_estimate.svg

TODOs