*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Lightweight Resiliency Mechanism via Compiler Techniques
Chao Chen
Ph.D. Student in Computer Science
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Monday, November 4, 2019
Time: 10:30 - 12:00 (EST)
Location: KACB 3126
Committee:
------------
Dr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)
Dr. Greg Eisenhauer (Advisor, School of Computer Science, Georgia Institute of Technology)
Dr. Ling Liu (School of Computer Science, Georgia Institute of Technology)
Dr. Vivek Sarkar (School of Computer Science, Georgia Institute of Technology)
Abstract:
-----------
Transient faults are a significant concern for emerging extreme-scale high performance computing (HPC) systems.
This nascent problem is exacerbated by technology trends toward smaller transistor size, higher circuit density and
he use of near-threshold voltage techniques to save power. While transient faults in memories can be managed with
parity techniques, faults in processing components are not so easily detectable and manageable. These faults can
cause major problems for HPC applications. Faults in different CPU components manifest differently and are best
approached in different ways. Faults manifested in floating point units are highly likely to corrupt applications’ state
without any warnings and lead to incorrect outputs (called Silent Data corruptions or SDCs), and faults in the integer
computations are more likely to cause control problems and/or manifest themselves as addressing faults which cause
application termination (named Soft Failures or SFs), because integer instructions tend to dominate control and address
calculations in HPC applications. While SDCs harm the confidence in computations and could lead to inaccurate scientific
insights, SFs degrade system efficiency and performance; SFs require the impacted jobs to be restarted from
their checkpoints and recomputing lost computations before continuing the normal operation. To address these
challenges, this thesis proposes a set of lightweight techniques to mitigate the impact of transient faults by both
exploiting application properties for SDC detection, and by leveraging compiler techniques for recovery. This work
makes the following contributions:
First, this thesis proposes LADR, a low-cost application-level SDC detector for scientific applications. LADR protects
scientific applications from SDCs by watching for data anomalies in their state variables. It employs compile-time
data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads
while maintaining a high level of fault coverage with low false positive rates.
Second, this thesis proposes CARE, a light-weight compiler-assisted technique for on-the-fly repair of processes crashed
by transient faults in the address path. The goal of CARE is to facilitate repaired processes to simply continue their executions
instead of being terminated and restarted. During the compilation of applications, CARE constructs a recovery kernel for each
load/store. It traps segmentation faults caused by the use of corrupted addresses, extracts appropriate state from the suspended
process and uses the recovery kernels to attempt to recreate a correct version of the address, so that it can retry the faulted
load/store and continue the application. CARE, leveraging compile-time preparation and using segmentation faults as
a detection mechanism, ensures that there is no run-time overhead under non-faulty execution and spends
minimal time in recovery under a runtime fault.
Finally, despite the promising results achieved by CARE, the scope of recovery is very challenging for important runtime
artifacts such as induction variable updates, which cause a significant portion of failures in many other scientific workloads.
To address this challenge, we look into the code optimization techniques in modern compilers, and found that some of these
techniques, such as strength-reduction, can open up opportunities by turning array accesses into strength-reduced pointers
which are updated independently in lockstep. Modified induction-variable-based strength-reduction allows independent but
equivalent computations (patterns) so that a correct value for the corrupted pointer can be inferred from the value of another.
Thus, smarter recovery kernels are designed to recover from a broader range of soft failures by exploiting “accidental”
redundancy introduced by code optimization techniques with no impact on code speed.