Whole-Program Adaptive Error Detection and Mitigation

From HPCRL Wiki
Revision as of 22:04, 17 January 2017 by Baow (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Errors in application state resulting from faults in hardware are an increasing concern on extreme-scale computing systems. Errors that escape detection and lead to silent data corruption are particularly problematic. Detecting errors is an important first step toward fault tolerant program execution. In contrast to performance optimization, there is limited understanding of scientific application resilience strategies. Existing approaches dealing with errors often are “point studies,” or techniques that typically address a particular class of errors (errors in memory, instruction execution, control flow, etc.) under specific assumptions about the hardware vulnerability for a precise or narrow class of applications.

To improve application resilience strategies, there is a pressing need to investigate: (1) how errors affecting different portions of the execution state for a scientific application can be effectively detected, (2) how individual detectors and hardware can be characterized and composed in an automated fashion to design the most efficient full- application solution, (3) how detectors and their composition can be evaluated to provide the most comprehensive insights into their, and (4) what errors and fault rates must be tackled primarily in hardware for effective execution of scientific applications. We propose a comprehensive approach to error detection and mitigation for scientific applications (Topics 1 and 2 in the solicitation) that combines configurable error detectors, a unified reliability specification, and whole-program detector composition.

We will design and characterize configurable error detection techniques while accounting for hardware vulnerability characteristics, application resilience requirements, and cost/capabilities of individual detector configurations. We will clarify the fault behavior of scientific applications and target hardware in terms of a unified reliability specification to be used for composing individual detectors—factoring the cost and coverage of each detector—and to develop an end-to-end error detection approach, characterized by the best detector composition for the entire application with respect to classes of errors being handled.

Publications

W. Bao, S. Krishnamoorthy, L.-N. Pouchet, F. Rastello, and P. Sadayappan. “Polycheck: Dynamic verification of iteration space transformations on affine programs.” POPL 2016.PDF SLIDES

W. Bao, S. Krishnamoorthy, L.-N. Pouchet, F. Rastello, and P. Sadayappan. “Polycheck: Dynamic verification of iteration space transformations on affine programs.” Technique Report 2015.PDF

Project Members

Personal tools