*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Compiler Guided Scheduling : A Cross-stack Approach for Performance Elicitation
Girish Mururu
Ph.D. Student in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Tuesday, November 12, 2019
Time: 1:30 - 3:30 pm (EST)
Location: KACB 2100
Committee :
-------------------
Dr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)
Dr. Ada Gavrilovska (School of Computer Science, Georgia Institute of Technology)
Dr. Kishore Ramachandran (School of Computer Science, Georgia Institute of Technology)
Dr. Vivek Sarkar (School of Computer Science, Georgia Institute of Technology)
Abstract :
-------------
Modern software executes on multi-core systems that share resources such as
several levels of memory hierarchy (caches, main memory as well as persistent storage),
as well as I/O devices. In such a co-execution environment, the performance of modern
software is critically affected due to the resource conflicts resulting due to the
sharing of resources. Compiler optimizations have traditionally focused on analyzing
and optimizing the performance of individual (single or multi-threaded) applications
and have resulted in tremendous strides in this regard. On the other hand, schedulers
have dealt with the problem of resource sharing mostly adopting fairness as the primary
criterion in terms of single core sharing while adopting load balancing and processor affinity
as the criteria in terms of multi-core scheduling. Both the compiler and the scheduler stacks have
continued to evolve stand-alone; this thesis claims that there is a significant opportunity
to improve performance (execution speed of individual applications), resource sharing and throughput
by developing a synergistic approach that combines compiler generated dynamic application attributes
with runtime aggregation of such information to undertake smart scheduling decisions.
One of the key goals of this work is to perform such a cross-stack approach by utilizing the
existing systems interfaces without introducing new ones. Another important characteristic is that the
approach is built with a layering solution without modifying the OS. Such a design is envisioned not to
perturb other systems properties which have evolved over a period of time.
Modern workloads related to computer vision, media computation and machine learning exhibit a very
high amount of data locality. Although modern OS deploys processor affinity to induce data
locality aware scheduling, lack of knowledge of precise dynamic application characteristics leaves a significant
performance inefficiency on the table due to a significant number of process migrations carried
out by the scheduler. In our first work, PinIt, we decrease unwanted migrations of processes among
cores by only influencing the scheduler without modifying it. In order to fairly and efficiently
utilize cores, schedulers such as CFS migrate threads between cores during execution. Although such
thread migrations alleviate the problem of stalling and yield better core utilization, they can
also destroy data locality, resulting in fewer cache hits, TLB hits and thus performance loss for
the group of applications collectively.
PinIt first determines the regions of a program in which the process should be pinned onto a core
so that adverse migrations causing excessive cache and TLB misses are avoided by calculating memory
reuse density, a new measure that quantifies the reuses within code regions. Pin/unpin calls are
then hoisted at the entry and exits of the region. The migration of the processes is prevented within
the pinned regions. The thesis presents new analyses and transformations that optimize the placement
of such calls. In an overloaded environment compared to priority-cfs, PinIt speeds up high-priority
applications in mediabench workloads by 1.16x and 2.12x and in computer vision-based workloads by
1.35x and 1.23x on 8cores and 16cores, respectively, with almost the same or better throughput for
low-priority applications.
In the second work, to achieve very high throughput for large number of batch jobs, we develop a
throughput-oriented scheduler that processes compiler inserted beacon that transmit applications' dynamic
information to the scheduler at runtime. Typically, schedulers conservatively co-locate processes to avoid
cache conflicts since miss penalties are quite heavy leading to lower resource utilization
((ranging from 50 to 70%). Moreover in a throughput oriented setting, such a conservative scheduling
leads to significant losses in terms of achieved throughput. Our approach relies on the compiler to
insert ``beacons'' in the application at strategic program points to periodically produce and/or
update details of anticipated resource-heavy program region(s). The compiler classifies loops in
programs based on cache usage and predictability of their execution time and inserts different types
of beacons at their entry/exit points. The precision of the information carried by beacons varies as
per the analyzability of the loops, and the scheduler uses performance counters at runtime to fine
tune decision making for concurrency. The information produced by beacons in multiple processes is aggregated
and analyzed by the predictive scheduler to proactively respond to the anticipated workload requirements.
A framework prototype demonstrates high-quality predictions and improvements in throughput over CFS by
up to 4.7x on ThunderX and up to 5.2x on ThunderX2 servers for consolidated workloads.
Finally, as a part of proposed work, we devise extensions to the beacon scheduler to target
low latency environment and also efficiently handle multi-threaded processes. To achieve low latency
while efficiently using resources, schedulers must divide the available processes among the available
cores such that latency is the lowest for each process. In case of multi-threaded processes, the compiler
and the runtime must send beacons for each thread and the scheduler must manage resources efficiently
among all the inter-process threads. We also plan to augment the beacon analysis with path and
call chain prediction to be able to predict the entire workload of an application at the start and
then send corrective beacons for mispredictions for a complete predictive scheduler.