Preserving Application Reliability on Unreliable Hardware
Apr 11, 2013
from 01:30 PM to 03:00 PM
|Where||Engr. IV Bldg., Shannon Room 54-134|
|Contact Name||Prof. Puneet Gupta|
|Add event to calendar||
University of Illinois at Urbana-Champaign
As technology scales, the increasingly smaller devices become susceptible to in-field hardware failures caused by high-energy particle strikes (or soft errors). Future systems, therefore, need low cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware errors using low cost software-level symptom monitors. These monitors detect only those errors that affect software execution. However, there remains a non-negligible risk that several faults may escape these detectors and silently corrupt application outputs, producing silent data corruptions (SDCs). Identifying SDC-vulnerable program locations and devising mechanisms to detect errors in them is, therefore, crucial for developing low cost reliability solutions. Such a solution would allow selective, low cost, and application-specific detectors that permit tuning for reliability vs. performance.
In this talk, I present Relyzer, an approach that systematically analyzes all application fault sites to identify virtually all SDC-vulnerable program locations. Instead of performing time-consuming fault injections on all possible application-level fault sites, which is impractical, Relyzer carefully picks a small subset. It employs novel fault pruning techniques that reduce the number of faults sites by either predicting their outcomes or showing them equivalent to others. Results show that 99.78% of faults are pruned across twelve studied workloads reducing the complete application reliability evaluation time by 2-6 orders of magnitude.
The ability to list virtually all SDC-vulnerable program locations can transform the area of designing application-centric reliability solutions. As a first step, I employed Relyzer, identified SDC-hot program locations, and developed low cost program-level error monitors to convert SDCs to detections. This application-centric solution, for the first time, allows software and hardware architects, who largely over-provision systems for higher reliability (by trading off performance or power), to tune reliability by protecting just the desired set of vulnerable program locations. Relyzer also opens new avenues of research in designing error-resilient programming models as well as even faster (and simpler) evaluation methodologies.
Siva Hari is a Ph.D. candidate in the Department of Computer Science at the University of Illinois at Urbana-Champaign. His research interests lie in the areas of computer architecture, reliability-aware systems, and energy efficient computing. He received the W.J. Poppelbaum Memorial Award from the Computer Science Department at University of Illinois at Urbana-Champaign in 2012 for academic merit and creativity in computer architecture. He also won the Margarida Jacome Best Poster Award at the GSRC Annual Symposium in 2012 where 56 projects were showcased from 15 universities. His Relyzer paper, originally published in ASPLOS 2012, was recently selected for IEEE Micro’s Top Picks 2013 issue; only 11 papers were chosen from top computer architecture papers published in 2012 be included in this issue. Siva holds a M.S. from University of Illinois and B.Tech. from Indian Institute of Technology (IIT) Madras.