*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Advancing Compiler Optimizations for Parallel Architectures
Prasanth Chatarasi
Ph.D. Student in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Monday, December 9, 2019
Time: 11:00 - 1:00pm (EST)
Location: KACB 3126
Committee:
------------
Dr. Vivek Sarkar (Advisor), School of Computer Science, Georgia Institute of Technology
Dr. Jun Shirako (Co-Advisor), School of Computer Science, Georgia Institute of Technology
Dr. Tushar Krishna, School of Electrical and Computer Engineering, Georgia Institute of Technology
Dr. Santosh Pande, School of Computer Science, Georgia Institute of Technology
Abstract:
-----------
Many parallel architectures are emerging to improve computing capabilities as we approach the end of Moore’s law (doubling transistor count per every two years). Contemporaneously, the demand for higher performance is broadening across multiple application domains. The above trends pose a plethora of challenges to application development using high-performance libraries, which is the de-facto approach to achieving higher performance, and some of the challenges are porting/adapting to multiple emerging parallel architectures, supporting rapidly advancing domains, and also inhibiting optimizations across library calls. Hence, there is a renewed focus on optimizing compilers from industry and academia to address the above trends, but it requires advancements in enabling them to a wide range of applications and also to better exploit current and future parallel architectures, which is the focus of my thesis.
Firstly, most of the compiler frameworks perform conservative dependence analysis in the presence of unanalyzable program constructs such as pointer aliasing, unknown function calls, non-affine expressions, recursion, and unstructured control flow, which can limit the applicability of transformations even though they may be legal to apply. Our work is motivated by the observation that software with explicit parallelism for multi-core CPUs, GPUs is on the rise. Our approach (PoPP) uses explicit parallelism specified by the programmer as logical parallelism, and refines the conservative dependence analysis with partial execution order from the explicit parallelism to enable a larger set of loop transformations for enhanced parallelization, compared to what might have been possible if the input program is sequential.
Secondly, despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output (write-after-write) dependences. A significant limitation in past work is the lack of a unified formulation that synergistically integrates multiple storage transformations to break the cycles and further unify the formulation with loop transformations to enable vectorization. To address this limitation, we propose an approach (PolySIMD).
Thirdly, finding optimal compiler mappings for an application onto an accelerator can be extremely challenging because of a massive space of possible data-layouts and loop transformations. For example, there are over 10$^{19}$ valid mappings for a single convolution layer on average for mapping ResNet50 and MobileNetV2 on a representative DNN edge accelerator. To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. The motivation for this decomposition is to reduce the size of the search space dramatically, and also to prioritize the optimization of off-chip data movement, which is 2-3 orders of magnitude more compared to the on-chip data movement. We introduce {\em Marvel}, which implements the above approach by leveraging two cost models to explore the two subspaces -- a classical distinct-block (DB) locality cost model for the off-chip subspace, and a state-of-the-art DNN accelerator behavioral cost model, MAESTRO, for the on-chip subspace. Our approach also considers dimension permutation, a form of data-layouts, in the mapping space formulation along with the loop transformations.
h space problem.
Finally, with the emergence of a near-memory thread migratory architecture (EMU) to address the locality wall from weak-locality applications, as part of the proposed work, we plan to develop locality and thread-migration aware compiler optimizations to enhance the performance of graph analytics on the EMU machine. Our preliminary evaluation of compiler optimizations such as node fusion and edge flipping gives a significant benefit over the original programs written without being aware of thread migrations.