Abstract

A solution to improve portability of automatically parallelized code, extend applicability of polyhedron model, organize dynamic load balancing between host system and accelerator is reached via on-the-fly transformations within JIT mechanisms of virtual execution system. An architecture of parallelizing optimizer module for ILDJIT is discussed. Transformed for heterogeneous execution LU-decomposition kernel illustrates significant speedup due to offload acceleration.

Motivation

Writing an efficient code to expose compute capabilities of parallel processor often leads to significant challenges. The task becomes more complex when targeting heterogeneous computing systems. The project aims to simplify programming of modern parallel architectures: multicore CPUs and various accelerators, such as AMD/Nvidia GPUs and Intel Xeon Phi.

Dynamic Parallelization of Computational Code as a Phase of Just-in-Time Compilation

An auto-parallelizing runtime

We propose a solution based on idea of virtual execution environment to address stated problems.

• All the parameters (technically - variables) become defined in runtime - the polyhedron model could be applied to wider set of programs, not only strict linear ones.
• Code portability is now the problem of execution environment, not the programmer.
• Platform-specific optimizations and heuristics could be applied within JIT mechanism in order to get best performance of parallelized code.

We chose ILDJIT - an open-source .NET virtual execution system and develop parallelizing extension for it. We have integrated state-of-art optimization and analysis approaches based on polyhedron model to the optimizer. A code to be parallelized needs to be parsed with polyhedron extractor firstly: for loops with induction variables, statements with (necessary) affine indexes are recognized and then transformed into polyhedral definition of statement domains. Space-time mapping leading to maximal parallelism is then computed. Resulting parallel programs are saved in ondisk storage called compute cache at the first time call to avoid unnecessary recompilation for already processed subprogram and its parameters in future.

LU-decomposition example

$$A = LU, \quad a_{i,j} = \begin{cases} 1 & \text{if } i = j, \text{ otherwise } 0 \end{cases}$$


$$\text{for } (j = 0; j < N; j++)$$


Underlined loops are parallel - their iterations could be mapped to different virtual and physical processors. k-loop is considered as time-loop.

We made benchmarks for square matrix of order $2^n$ of doubles on two different heterogeneous computing systems:

- Intel Xeon X5650 + Nvidia Tesla M2050 (adaptive load balancing between CPU cores and GPU)
- Pure Xeon Phi offload
- Pure Xeon Phi offload

Dynamic load balancing

A heuristic is to choose space-time mapping uniformly exposing fine-grained parallelism w.r.t. most of time steps (if possible), then determine representative benchmark tile size and estimate time parameters in order to compute a value for every time step. The code generation phase should be followed by instrumentation: runtime estimations and dynamic workload distribution should be injected into target program.

• A portion of virtual processors (say, $a$ from $N$) on every time step should be mapped to accelerator, if reasonable. We also should leave one CPU core for accelerator management. Rest $N-a$ virtual processors should be mapped to free CPU cores.
• Let $T_{ACC} = \frac{dep_{input}}{v_{ACC}} + a \cdot v_{ACC}$ be accelerator time, where $dep_{input}$ stands for input dependencies communication overhead and $v_{ACC}$ is average processing cost of computations corresponding to one virtual processor, including communication.
• Let $T_{CPU} = \frac{dep_{input} - prob_{rad}}{v_{CPU}} \cdot (N - a)$ be CPU time, where $prob_{rad}$ is the number of slices processed while estimating time parameters.
• All the parameters we suggest to estimate empirically. Total execution cost is $T_{MAX} = T_{ACC} + T_{CPU}$. It means that we look for the point $a$ for which $T_{ACC} + T_{CPU}$ holds. Usage of the accelerator is reasonable if and only if $a > \frac{N - prob_{rad}}{v_{CPU}} \cdot (N - 1) \cdot \frac{dep_{input}}{v_{CPU}} + (N - 1) \cdot \frac{prob_{rad}}{v_{CPU}}$. Let $a = \frac{1}{N - prob_{rad}} \cdot v_{CPU} \cdot (N - 1) \cdot \frac{dep_{input}}{v_{CPU}} + (N - 1) \cdot \frac{prob_{rad}}{v_{CPU}}$

Goals

We rely on strong mathematical background of polyhedron model and work for improving of common practices of using parallelizing compilers to:

- reach the portability of automatically parallelized code w. r. t. parallel architectures - the code written once must be runnable on several (supported) parallel architectures without full manual recompilation.
- shift the polyhedron model use to the extreme to deal with statically undefined model parameters.

HPC microarchitectures

AMD GPUs

Nvidia CUDA GPUs

Intel MIC

Multicore CPUs

For dynamic load balancing case we collected values of alpha during run time (adaptive variant) and tried to approximate the regression with quadratic polynomial (approximated variant):

$$4.5 \cdot 10^{-7} \cdot k^2 - 0.00189496 \cdot k + 46.22040315$$

Approximated variant performs faster because of eliminated estimations. Unfortunately, we had no success with Xeon PHI system – latencies of data transition were too large when compared to Nvida platform.

Results

• The solution allows to write sequential code once using high-level language (we consider C# for now) and then run it in parallel on different architectures without full manual recompilation.
• Significant speedup is achieved with polyhedral parallelization and adaptive load balancing between host and accelerator on every time step. One can benefit from offloading of computations onto modern accelerators; but should be careful with scenarios with intensive data transition over PCI-E bus. While Nvida platform allows dynamic load balancing, this is not an option because pure offload could perform better. Xeon PHI platform works well with pure offload and big data transitions hiding huge latencies.

Acknowledgements

Work was accomplished with hardware supplied by Intel Corporation to Rybinsk State Aviation Technical University.