PhD Proposal by Chao Chen

*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************

Event Details

Date/Time:
- Monday November 4, 2019 - Tuesday November 5, 2019
  10:30 am - 11:59 am
Location: KACB 3126
Phone:
URL:
Email:
Fee(s):
N/A
Extras:

Contact

No contact information submitted.

Summaries

Summary Sentence: Lightweight Resiliency Mechanism via Compiler Techniques

Full Summary: No summary paragraph submitted.

Title: Lightweight Resiliency Mechanism via Compiler Techniques

Chao Chen

Ph.D. Student in Computer Science

School of Computer Science

College of Computing

Georgia Institute of Technology

Date: Monday, November 4, 2019

Time: 10:30 - 12:00 (EST)

Location: KACB 3126

Committee:

------------

Dr. Santosh Pande (Advisor, School of Computer Science, Georgia Institute of Technology)

Dr. Greg Eisenhauer (Advisor, School of Computer Science, Georgia Institute of Technology)

Dr. Ling Liu (School of Computer Science, Georgia Institute of Technology)

Dr. Vivek Sarkar (School of Computer Science, Georgia Institute of Technology)

Abstract:

-----------

Transient faults are a significant concern for emerging extreme-scale high performance computing (HPC) systems.

This nascent problem is exacerbated by technology trends toward smaller transistor size, higher circuit density and

he use of near-threshold voltage techniques to save power. While transient faults in memories can be managed with

parity techniques, faults in processing components are not so easily detectable and manageable. These faults can

cause major problems for HPC applications. Faults in different CPU components manifest differently and are best

approached in different ways. Faults manifested in floating point units are highly likely to corrupt applications’ state

without any warnings and lead to incorrect outputs (called Silent Data corruptions or SDCs), and faults in the integer

computations are more likely to cause control problems and/or manifest themselves as addressing faults which cause

application termination (named Soft Failures or SFs), because integer instructions tend to dominate control and address

calculations in HPC applications. While SDCs harm the confidence in computations and could lead to inaccurate scientific

insights, SFs degrade system efficiency and performance; SFs require the impacted jobs to be restarted from

their checkpoints and recomputing lost computations before continuing the normal operation. To address these

challenges, this thesis proposes a set of lightweight techniques to mitigate the impact of transient faults by both

exploiting application properties for SDC detection, and by leveraging compiler techniques for recovery. This work

makes the following contributions:

First, this thesis proposes LADR, a low-cost application-level SDC detector for scientific applications. LADR protects

scientific applications from SDCs by watching for data anomalies in their state variables. It employs compile-time

data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads

while maintaining a high level of fault coverage with low false positive rates.

Second, this thesis proposes CARE, a light-weight compiler-assisted technique for on-the-fly repair of processes crashed

by transient faults in the address path. The goal of CARE is to facilitate repaired processes to simply continue their executions

instead of being terminated and restarted. During the compilation of applications, CARE constructs a recovery kernel for each

load/store. It traps segmentation faults caused by the use of corrupted addresses, extracts appropriate state from the suspended

process and uses the recovery kernels to attempt to recreate a correct version of the address, so that it can retry the faulted

load/store and continue the application. CARE, leveraging compile-time preparation and using segmentation faults as

a detection mechanism, ensures that there is no run-time overhead under non-faulty execution and spends

minimal time in recovery under a runtime fault.

Finally, despite the promising results achieved by CARE, the scope of recovery is very challenging for important runtime

artifacts such as induction variable updates, which cause a significant portion of failures in many other scientific workloads.

To address this challenge, we look into the code optimization techniques in modern compilers, and found that some of these

techniques, such as strength-reduction, can open up opportunities by turning array accesses into strength-reduced pointers

which are updated independently in lockstep. Modified induction-variable-based strength-reduction allows independent but

equivalent computations (patterns) so that a correct value for the corrupted pointer can be inferred from the value of another.

Thus, smarter recovery kernels are designed to recover from a broader range of soft failures by exploiting “accidental”

redundancy introduced by code optimization techniques with no impact on code speed.

Additional Information

In Campus Calendar

Groups

Graduate Studies

Invited Audience

Faculty/Staff, Public, Graduate students, Undergraduate students

Categories

Other/Miscellaneous

Keywords

Phd proposal

Status

Created By: Tatianna Richardson
Workflow Status: Published
Created On: Oct 28, 2019 - 9:19am
Last Updated: Oct 28, 2019 - 9:19am

Georgia Tech

PhD Proposal by Chao Chen

Additional Information