Automatic loop parallelization

15 Dec 2021

Reading time ~26 minutes

Automatic parallelization of any program in general, or more specifically the loops contained within, comes to be one of the biggest challenges faced by the computing industry today. Yet, many applications are often written in ways that prevent automatic parallelization by compilers. Opportunities for optimization are therefore, often overlooked. While there has been progress over the years to make the degree of parallelization more generalized, there still are obstructions to make parallelism applicable and successful in all cases.

Having recently written a literature review on the subject, I am including a part of my writeup here which outlines the previous (notable) and current state-of-the-art research on automatic loop parallelization, as they facilitate the continued development of parallel programs.

What’s holding us back?

The major impediments that pose a barrier when automating loop parallelization are the data dependencies that arise in between loop iterations for elements contained within. Thus, this section will be discussed first, before exploring the various categories of relevant research with an emphasis on how they deal with this impediment and how they extend the scope of automatic parallelization, along with other details of significance. Although not an exhaustive list, care has been taken to include the significant areas of research that portray different approaches of automatic loop parallelization, which include:

Speculative execution (with subsections based on thread level speculation and polyhedral dependence analysis),
Subscripted subscript pattern analysis,
Compiler guided refactoring with a feedback mechanism,
Automatically scalable computation and
Static compiler parallelization.

Data dependence profiling is also briefly covered towards the end.

Data dependence analysis

Data dependencies carry the literal meaning here with respect to loop iterations, i.e. they are the inter-iteration dependencies (interchangeable term) for the elements within. These can come in multiple forms - three to be precise.

First, there is the read after write case which arises when a location in memory is written to before it is read. This is what we commonly define as a ‘true data dependency’, since the current loop iteration requires data produced from the previous iterations and is hence for any loop iteration, dependent on the immediately preceding iteration.

Then there is the write after read case which is when a location in memory is read before that same location is written to, or when the current loop iteration must use data before it is updated by the next iteration (the anti-dependency). The instruction that writes the data into a memory location has to wait until that memory location has been read by the previous instruction or else the reading instruction would read the wrong value.

Finally, there is the write after write case (output dependency), wherein a location in memory is written to before that same location is written to again in another statement (i.e., different iterations write to different memory addresses). The second instruction to write the value to a memory location needs to wait until the first instruction finishes writing data to the same memory location, else when the memory location is read at a later time it will contain the wrong value.

Often, we are mainly concerned with the first form, since that pops up the most in practical applications (for most cases, data is read and written from arrays or pointers which lead to dependencies in between loop iterations). Thus, the goal of data dependence analysis is to determine if a loop iteration may depend on data written in a previous loop iteration, as automatic parallelization is usually prohibited in such cases.

Speculative execution

Speculative execution (SE) is an optimization technique where work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing the work after it is known that it is required. Thus, if we were to apply this to our case of loop parallelization, such techniques would run the loops in parallel even in the presence of a dependency, since the detection of dependencies comes after. While there is a risk of failure (to which different SE approaches and systems specify different correction methods, from flushing the pipeline to going back to the initial baseline sequential state), it does serve its purpose in quite a few cases.

Thread level speculation

A common use of SE is to divide a sequential program into parallel threads in a way that does not violate the sequential semantics of the program. There should be an execution order among the threads for the SE to guarantee correctness, and thus the safe (or non-speculative) thread here should precede all speculative threads. This setup is referred to as Thread Level Speculation (TLS), which parallelizes code with ambiguous dependencies by making assumptions about the dependencies, then it uses hardware to check those assumptions at runtime, i.e. there is usually some sort of special hardware support that checks that no inter-thread dependence is violated. If there is a violation, the incorrect threads are squashed, the system rolls back to a previously stored steady state, and the threads are re-executed (after a number of retries, the threads become sequential eventually). If the speculations are correct, the computation is used; if they are not, they can be ignored.

TLS tends to be a promising technique to easily build parallel software and effectively exploit all of the raw performance potential from a machine with parallel capabilities as it enables the compiler to optimistically create parallel threads, despite uncertainty as to whether those threads are actually independent. Steffan et al. proposed and evaluated a design for supporting TLS that seamlessly scales to any machine size because it is a straightforward extension of writeback invalidation-based cache coherence (which itself scales both up and down). Their experimental results demonstrate that the scheme performs well on both single-chip multiprocessors and on larger-scale machines where communication latencies are twenty times larger.

Polyhedral dependence analysis

A particular area of research under SE uses the aforementioned TLS setup with polyhedral dependence analysis, wherein more speculation candidate loops can be discovered than normal OpenMP parallelization, thus extending the scope of automatic loop parallelization in general. Bhattacharyya and Amaral pursued this research and implemented a framework that can automatically perform speculative parallelization of loops using Polly’s polyhedral dependence analysis (polyhedral optimization in LLVM). The framework uses two different heuristics to find speculation candidates, the first of which allows loops with only may dependencies to run speculatively in parallel, while the second heuristic filters out cold loops and, using profile information, loops with actual run time dependencies as well.

Here, data dependencies (more specifically, inter-thread dependencies) are typically captured by monitoring the data written and the data read by individual threads. A data dependence violation occurs when a thread writes to a location that has already been read by a successor thread, and such dependence violations lead to the squashing of thread (which involves discarding the side effects produced by the thread being squashed).

As for the algorithm in the framework (which is implemented in the LLVM infrastructure), two simple heuristics are used to find speculation candidates. The first one relaxes the constraints of OpenMP parallelization by allowing loops with may dependences to be speculatively executed in parallel. The second heuristic is an extension of the first, being used to filter out loops with run-time dependence (with the help of profiling) and cold loops (because they are not a good candidate for speculation and may cause a slowdown). It was actually introduced to cater to a performance issue - the algorithm had some overhead from SE which negates the gain from parallel execution and would underperform with respect to OpenMP for small loops (while taking the first heuristic into account). The authors realized that they needed to avoid this, and thus introduced the second heuristic, instead of limiting the algorithm to be based on the first one alone. In order to handle such slowdowns, the loops that do not take a significant amount of the program execution time were not speculatively parallelized because they negate the gain of parallelization due to the thread creation overhead.

Experimentation was performed on two different sets of benchmarks - SPEC2006 benchmarks (chosen because it has been used by other researchers for the evaluation of SE) and the PolyBench/C benchmarks (chosen because they are suitable for polyhedral analysis), both on the IBM BlueGene/Q machine. For calculating the speedups, each benchmark was run ten times for a given input and the average running time from those runs was computed. Bottomline conclusion?

The algorithm portrayed equal or better performance than OpenMP for all cases after profiling dependences plus preventing cold loops and loops with a run time dependence (yes, the second heuristic). Thus overall, this technique adds application cases that outperform OpenMP as the framework is able to discover more parallel loops than OpenMP via profiling than static dependence analysis, being supplemented with speculative parallelization.

Subscripted subscript patterns

A number of scientific applications comprise of loops wherein an array is subscripted by another array. Automating the parallelization of such loops is not straightforward in practice here as with write references to the host array within a loop, the current compile-time techniques are incapable of detecting such loops as being parallelizable; i.e the challenge to the compiler comes when we have write references to the array with the subscript expression, which in turn contains the value of another array. In such cases, the current compile-time techniques are insufficient to prove independence across iterations of the loop.

With regards to this, Bhosale and Eigenmann proposed a compile-time algorithm which can gather the requisite information for automatically parallelizing loops containing subscripted subscript patterns, and then by analyzing the recurrence relations for a class of such patterns, it can detect monotonicity of the subscript array, one of the main properties which can, in turn, be used to prove the absence of data dependencies in loops that involve subscript arrays. This tends to be the first of its kind given the fact that this algorithm works during compilation, as opposed to some existing techniques which work but during the runtime only.

Yes, the idea of using subscript array properties for parallelizing loops with subscripted subscript patterns is not new. While there has been research that successfully analyzed certain properties (e.g., McKinley’s work), success is yet to be achieved in fully automating the process of detecting the properties and parallelizing the involved loops. Static analysis techniques for identifying loop monotonic statements presented by Spezialetti et al. are capable of detecting monotonicity for scalar variables, but the techniques are insufficient for detecting monotonicity of subscript arrays at compile-time.

Runtime analysis techniques using either speculative execution (like we just discussed above in section 3) or an inspector/executor scheme are capable of analyzing the access pattern of the subscript array and executing the involved loop in parallel if no cross-iteration dependencies exist. Though powerful, a major drawback of these techniques are the runtime overheads due to the generation and execution of the inspector and executor code (in the case of inspector/executor schemes) and re-executing the loop sequentially if loop-carried dependencies are detected (in the case of speculative runtime parallelization).

By contrast, the technique here uses compile-time analysis, thus avoiding runtime overheads. While this does introduce some overhead during compilation, it is minimal-costing and can be considered negligible in all cases (outweighing the overheads). The algorithm just analyzes the loops (or the loop structure, from inner to outer loop for loop nests), goes through a form of program dependency graph (control flow graph to be more specific here, like I mentioned in my slides), and aggregates values to our variables of interest, with subsequent analysis to get a property like monotonicity to help the compiler in recognizing the application algorithm’s loops to be eligible to apply automatic parallelization by the application of the technique in discussion. For analyzing the loops, traversing the graph from node to node (or statement to statement, based on the underlying code of the application algorithm) and then subsequently detecting the presence of properties, there is no form of computation involved (and these can be performed in one iteration itself in most cases).

Use of this compile-time technique shows substantial performance improvement in applications such as sparse matrix scaling and supernodal Cholesky factorization (with speedups of 739\% and 383\% respectively).

An important observation from this research is that in many such application programs, the necessary and sufficient information that the involved loops can in fact be parallelized is present in the program code itself, and was not dependent on the input data as one would assume usually. Also, this technique works with the OpenMP API, which as we know, uses directives appended to #pragma for constructing parallel sections (using the fork-join model) of code. These directives are a way to trigger the OpenMP algorithm during compilation time (hinting the compiler to construct the assembly with multiple thread execution capabilities), and the algorithm proposed can add to this logic that altogether provides hints to the compiler.

Work related to the analysis methods used in this technique here include abstract interpretation and loop aggregation methods. For instance, Ammarguellat and Harrison presented a method for automatically recognizing recurrence relations in loops at compile-time. Their method uses abstract interpretation to summarize the net effect of the loop body upon each variable assigned in the loop, referred to as the abstract store. The presence of a recurrence relation is determined by matching the abstract store with one of nine pre-defined recurrence templates. Their techniques can recognize recurrence relations between scalars and one-dimensional arrays but are insufficient in recognizing complex recurrence expressions such as monotonic range assignments.

This algorithm also builds on a method applied by Tu and Padua in their array privatization technique to analyze array sections that are defined and used. This method contrasts with abstract interpretation and is able to precisely capture the effect of recurrence relationships, which allows us to gather such array properties as monotonicity. Understanding these effects is the key for successfully recognizing parallel loops.

More on similar work includes the hybrid analysis techniques proposed by Oancea and Rauchwerger which make use of static inter-procedural and run-time techniques to analyze memory access patterns for proving monotonicity of subscript expressions. Their techniques can prove the independence of loops at compile-time by generating and analyzing predicates that represent sufficient conditions for parallelization. Predicates that are too complex to analyze at compilation time are evaluated during program execution. They could prove monotonicity for subscript arrays where, the value of the subscript array can be expressed in the form of a closed form expression at compile time.

Turns out that with the presence of properties such as monotonicity (which can be derived from the very presence of a recurrence relation in between the subscript array elements) and then its subsequent detection capability, the compiler can prove the absence of data dependencies and in turn parallelize the loops (without the need for user assertions or pattern matching), which enhances the scope of our subject in discussion.

Compiler guided refactoring (feedback system)

Oftentimes, the compiler is blamed for not being able to devise a way to parallelize code. While this is true to a large extent, we shouldn’t expect much given that there isn’t any back and forth communication between the programmer and the compiler, or the lack of a feedback mechanism is something we should blame at instead. Luckily, there has been some notable progress on this area as well, with the creation of a system that can now explain how the compiler provides feedback and suggestions on how the programmer can make his/her code amenable to auto-parallelization.

Larsen et al. proposed an interactive compilation feedback system that guides the programmer in iteratively modifying application source in order to improve the compiler’s ability to generate loop-parallel code.

By using the compilation system to modify two sequential benchmarks, they found that the code parallelized in this way runs up to 8.3 times faster. The benchmark kernels exposed several coding patterns which obstruct auto-parallelization in production compilers.

The interactive compilation feedback system that was proposed has two parts, the first of which is a library named ‘libcodecomments’, and a set of patches to gcc’s auto-parallelization subsystems. This extension of gcc generates code comments containing compiler feedback. In contrast to stand-alone tools, the code comments leverage the production quality program analysis and optimization functionality already present in the compiler. The code comments are generated when one of the steps in the auto-parallelization optimization encounters an issue that prevents further analysis. The functionality in libcodecomments is then used to produce a human understandable problem description. This is important because program analysis often fail while processing compiler-generated, temporary variables that are meaningless to the programmer. Most importantly, libcodecomments is used to reconstruct source-level expressions (after preprocessing) and their file locations from compiler-generated temporaries. The generation of diagnostic dump files is controlled via the existing compiler flags, and the code comments are simply added to these dump files.

The second part of the system is a plug-in for the Eclipse C Development Tools (CDT). The code comments plugin enables CDT to parse the compiler feedback from dump files. The dump files are read by a custom step in the Eclipse build process and requires no programmer intervention besides adding the appropriate compiler flags. The raw code comments are subsequently converted into markers, which are shown as icons in the left margin of the code in the Eclipse source editor. The markers automatically track the source code construct, such as a loop or variable associated with the code comment. The comment may include a quick fix – i.e. a refactoring that automates the suggested transformation.

Similar to compiler warnings and errors, the code comments are automatically updated after each full or incremental build. Not all the code comments which can be generated by the modified compiler contain concrete advice on how to resolve a given issue. The types of feedback currently available to a non-compiler expert include aliasing comments, comments on function calls that prevent parallelization due to possible side-effects and comments on data-dependences among memory accesses. These comments are considered sufficient to address the most important compilation issues – those which are the least likely to be resolved in future releases of the gcc compiler.

Automatic parallelization involves numerous analysis steps, and every high-performance optimizing compiler must perform similar steps. The concrete implementations usually vary, leading to different strengths and weaknesses among compilers.

Here, gcc is used to exemplify why the compiler may be prevented from auto-parallelizing a loop nest. Temporary arrays and extra loop nests were introduced in the auto-parallelized version to work around limitations in gcc’s data dependence analysis. Auto-parallelization also uses the combined work-sharing construct omp parallel for in OpenMP whereas an expert performance-programmer may enclose several loop nests with omp for directives in a single omp parallel region to reduce synchronization among threads. To measure the resulting performance if the aforementioned deficiencies were removed, the authors hand-parallelized the demosaicing code with OpenMP pragmas. Like the autoparallelized version, the OpenMP version only exploits data parallelism but performs updates of the arrays in-place instead of using temporary arrays. Furthermore, the entries were minimized, which was achieved by using separate omp parallel and omp for directives in place of the omp parallel for directive. This reduced the number of times a parallel section was entered from twelve to five. Using the nowait clause on the omp for directives allowed the authors to remove three implicit barriers in total.

Currently, gcc contains two different frameworks to analyze data dependencies, out of which, the lambda framework is the oldest and most mature among the two. It represents data dependencies as distance vectors and implements the classical data dependence tests. Much functionality, including the lambda framework itself, is shared between the loop parallelization and vectorization optimizations. Hence, code transformations that enable gcc to parallelize a loop using the lambda framework will also help in making it vectorizable.

The lambda dependence analysis framework requires that all possibly dependent loop iterations have the same distance vector. If it is not the case, then automatic parallelization fails. This failure happens because the structure data accesses occur through subscripted variables (as discussed in the section above) is too complex to be modeled (or rather was now), even if the iterations of the loop nest are in fact independent. Some loop nests in the demosaicing benchmark were not auto-parallelized for this reason.

After transforming the code to allow all preceding analysis steps to succeed, gcc was able to perform data dependence analysis on the loop nests. Although iterations of all loop nests in the benchmark are independent, eight of these loop nests update elements in place to reduce memory requirements. The in-place updates are possible when a loop nest writes only “odd” elements and reads only “even” elements or vice versa.

The authors could determine from the code comments of the application program that the data dependence analysis therein failed to discover this. For instance, the lambda framework can not compute a distance-vector that represents their dependence relation. A possible dependence between two such references must therefore be assumed in lieu of a more precise data dependence analysis. To avoid reading and writing to the same array (the memory addressed by the array in the example), a new temporary array has to be allocated to hold the writes. This effectively sidesteps a compiler’s inability to analyze non-overlapping accesses to the same array. However, it also means that updates are no longer done in-place, which decreases the spatial locality of the kernel and increases the memory consumption by approximately 8\%.

For each of the eight loop nests (in the application code) with in-place updates, a simple “copy” loop was added to write data from the new temporary array back to its original destination. These loops were fairly easy to add and were readily parallelized due to their simplicity.

However, an alternative solution exists: the programmer could have introduced additional restrict-qualified pointers until all potential data dependencies are ruled out. This solution affects neither the data access pattern nor the memory consumption, so performance would be unaffected. This shows that the programmer may need to choose among several alternative ways to refactor – each having a different performance impact. The authors pessimistically assume that the programmer chooses the type of refactoring that is the costliest in terms of performance.

Automatically scalable computation

Automatically Scalable Computation (ASC) achieves automatic parallelism by treating computation as a dynamical system. The memory and registers of a program form a (very) large state vector, and the instruction set architecture acts as a transition rule that moves the computation from one state to another. The parallelization strategy used here is to predict points that fall on the program execution trajectory (through the state space) and dispatch speculative executions from those predicted points to available cores. If the actual computation reaches a predicted state, it can then fast-forward to the point at which the speculative execution is completed. Crucially, these fast-forwards happen only if the predictions are correct, so there is no risk of bad execution. Thus, ASC can speed up program execution by using available cores to perform some computations early and caching the results until they are useful.

Kraft et al. presented an implementation of the ASC architecture that runs natively on x86 hardware and achieves near-linear speedup up to 44-cores (the size of their test platform) for several classes of programs, such as computational kernels, map-style programs, and matrix operations. They observe that programs are either completely predictable, achieving near-perfect predictive accuracy, or totally unpredictable, and therefore not amenable to scaling via ASC-like techniques.

Important findings from this research include that in most cases, the speedup is limited only by implementation details such as the overhead of the dependency tracking infrastructure and the manipulation of large state spaces. They were able to automatically parallelize programs with linked data structures that are not amenable to other forms of automatic parallelization.

A special case of automatic parallelization is automatic vectorization where multiple pairs of scalar operands are operated on at once. Such compilers use data dependence analysis to check the feasibility of such transformations and output a vectorized instruction if possible.

The granularity of parallelization in vectorization is much finer than that aimed at by ASC and is thus complementary to ASC. Moreover, ASC benefits from vectorized instructions, exploiting them heavily in its internal sparse state vector representation.

The primary advantage of ASC with respect to other automatic learning systems is its flexibility. Other systems mostly use some form of program analysis (both static and dynamic) to analyze dependencies within loops and are limited by the power of the analysis techniques.

ASC, however, converts parallelization into a machine learning problem and is limited only by the ability of its learners to predict future states of a program, such as future iterations of a loop dependent on work done in past iterations. This means that programs that cannot be parallelized by a static compiler (the next section elaborates a bit on this) because of the data dependencies (loop dependent or independent) can be parallelized by ASC.

Static compiler parallelization

The most traditional method of parallelization is static compiler parallelization, which can take different forms. The most frequently used approach is to write the program in a parallel language or with a library like OpenMP, or with a parallel runtime like Cilk.

OpenMP uses pragmas to provide an API to the compiler to indicate opportunities for parallelization. Although not entirely automatic, it does the work of parallelization, with the programmer only needing to provide hints.

Cilk on the other hand is a superset of C/C++ that introduces keywords for forking and joining threads and provides a scheduler to use them properly.

Again, this is not fully automatic, but makes it relatively easy for programmers to parallelize otherwise sequential code. Fully automatic static compilation requires that the compiler recognizes parallelizable loops, detects that the loop computations are independent, and then finally, that it introduces the proper thread structure to leverage the opportunity.

For example, the Intel Compilers will automatically generate OpenMP pragmas for C++ and Fortran, when they detect loops that can be safely and efficiently executed in parallel. Such static methods are limited to parallelizing loops where the compiler is certain of the absence of the data dependencies.

Compilers used in high-performance computing employ sophisticated dependent analysis techniques that include identifying important programming patterns and transforming them to equivalent programs that are amenable to parallelization. However, these techniques only handle regular loops and quickly degenerate in presence of less regular loops or non-standard patterns.

Traditionally, data dependence analysis has been done statically by compilers as well. Techniques such as the one proposed by Kong et al. (GCD and inequality tests) were used in the past to analyze data dependencies in array-based data accesses. However, this static analysis may not be effective in languages that allow pointers and dynamic allocation. For example, researchers observed that state-of-the-art automatic parallelizing compilers often failed to parallelize simple embarrassingly parallel loops written in C/C++. It also had limited success in irregular data structures due to pointers and indirect memory accesses.

Rather than entirely relying on static analysis, dynamic analysis using data-dependence profiling is an alternative or a complementary approach to address the limitations of the static approach since all memory addresses are resolved in runtime. This brings us to the next and final section of research for this post/review.

Data dependence profiling

As multicore processors are now ubiquitous in mainstream computing, the need for software tools to help parallelize programs is increasing dramatically. Data dependence profiling is an important technique to exploit parallelism in programs. More specifically, manual or automatic parallelization can use the outcomes of data-dependence profiling to guide where to parallelize in a program.

However, state-of-the-art data dependence profiling techniques are not scalable as they suffer from two major issues when profiling large and long-running applications - the first being runtime overheads and the second being memory overheads. Existing data dependence profilers are either unable to profile large-scale applications or only report very limited information.

In response to this, Kim and Luk proposed a scalable approach to data dependence profiling that addresses both runtime and memory overheads in a single framework. Their technique, called SD3, reduces the runtime overhead by parallelizing the dependence profiling step itself. To reduce the memory overhead, they compress memory accesses that exhibit stride patterns and compute data dependencies directly in a compressed format.

They demonstrate that SD3 reduces the runtime overhead when profiling SPEC2006 by a factor of 4.1 and 9.7 on eight cores and 32 cores, respectively. For the memory overhead, they successfully profile SPEC 2006 with the reference input, while the previous approaches fail even with the training input. In some cases, they observe improvement by a factor of more than twenty times in memory consumption and a speedup of 16 times more in profiling time when 32 cores are used.

The proposed algorithm has two components, the first being a new data dependence profiling technique using a compressed data format to reduce the memory overhead by stride detection, and the second being the use of parallelization to accelerate the data-dependence profiling process (i.e., that data-dependence profiling itself can be effectively parallelized). Both of these are important contributions to the topic of data dependence profiling.

Future work and conclusion

While the different areas of research discussed above have made significant progress in the area of automatic parallelization, there is a lot of room for improvement, including the extension of capabilities for the mentioned frameworks and algorithms, such as for instance:

Capturing the dependence behaviour of loops (loops with runtime dependencies were filtered out by heuristic two, but Berube et al. showed that the run time behaviour changes with the change of input to the program - whether such changes of behaviour also affects the materialization of may dependences remains to be investigated. If for a number of inputs the loop has a high probability of being independent, the loop can be executed speculatively in parallel) and appropriate modification of the second heuristic in the framework proposed by Bhattacharyya and Amaral.
The inclusion of support for analysis of subscript patterns for the monotonically decreasing variant and support for matrices or two-dimensional arrays when using the compile-time algorithm presented by Bhosale and Eigenmann. In addition, application of the technique inside Cetus parallelizer is another goal with ongoing work.
Further state-space reduction and CPU efficiency improvements (the current way of performing data dependency analysis is suboptimal) in the ASC architecture that Kraft et al. proposed.
Dynamic detection of both stride and non-stride patterns (simultaneously) in loop nests, and advice provision from a scalable data dependence profiler for parallelizing legacy code for Kim and Luk’s SD3 technique.

All of these tend to be promising research directions with the potential to greatly enhance the scope of automatic parallelization.

Finally, it should be noted that automatic parallelization is not required for all programs, although to make it generic and supply as a possibility to the programmer is the goal. While there is no single solution to parallelization, it’s often not just the only optimization, or what we primarily need to optimize for. Frequently benchmarking and testing one’s code to check if it’s operating with acceptable performance (as close to the theoretical maximum throughput one’s CPU is capable of) and utilizing the desired resources (multiple CPU threads and RAM for instance) are two things one should always do first, with the follow-up of attempts to parallelize code only if necessary.

Anirban

12/15/2021