Opportunistic Memory Systems in Presence of Hardware Variability

Speaker: Mark Gottscho
Affiliation: Ph.D. Candidate - UCLA

Abstract: The memory system presents many problems in computer architecture and system design.  A fundamental issue is worsening hardware variability and environmental sensitivity due to manufacturing difficulties in the nanometer nodes. As a consequence, memories often limit the resiliency and energy-efficiency of computing platforms from embedded systems to cloud datacenters and supercomputers. To help address these challenges, in this dissertation, I propose the design of systems that: (Part 1) opportunistically exploit memory variability; and (Part 2) opportunistically cope with memory errors to improve energy efficiency and resiliency.

In this talk, I will first provide a brief overview of six projects that comprise the dissertation, and then I will focus on the novel concept of Software-Defined Error-Correcting Codes (SDECC). SDECC is a new approach to fault-tolerance that tightly integrates ECC hardware with system software for improved system reliability and availability. SDECC makes it possible to heuristically recover from detected-but-uncorrectable errors (DUEs) in memory without adding any overheads in the normal/common cases when DUEs do not occur. The key insight is that for any given DUE, there are a small number of “candidate codewords” that could correspond to the original uncorrupted message data. A software-based recovery policy leverages side information about general patterns of application data – such as local data entropy in a cacheline, or the relative frequencies of program instructions – to choose the most likely candidate codeword (when confidence is high) or otherwise force a machine panic to avoid data corruption (when confidence is low). Using typical codes such as SECDED and ChipKill, SDECC can prevent the vast majority of system crashes and/or checkpoint rollbacks that would otherwise result from memory DUEs. The technique could improve the availability and energy-efficiency of supercomputers and the reliability of approximation-tolerant applications without burdening systems with extra hardware overheads and design complexity.

Biography: Mark Gottscho is a Ph.D. Candidate in the Electrical Engineering Department at UCLA and is advised by Prof. Puneet Gupta. His research focuses on the cross-layer relationships between hardware variability/reliability and memory architectures/systems. He received his BS in 2011 and his MS in 2014 from the Department. In 2014, his MS work earned an Honorable Mention from the NSF Graduate Research Fellowship Program as well as the Department’s Outstanding Master’s Research award. In 2016, he won both the UCLA Dissertation Year Fellowship and the Qualcomm Innovation Fellowship (together with Clayton Schoeny) for his Ph.D. work on Software-Defined Error-Correcting Codes. Mark will join the datacenter hardware platforms group at Google upon successful completion of his Ph.D.  He is a student member of both the IEEE and the ACM.

For more information, contact Prof. Puneet Gupta (puneet@ee.ucla.edu)

Date/Time:
Date(s) - May 12, 2017
12:30 pm - 2:30 pm

Location:
E-IV Tesla Room #53-125
420 Westwood Plaza - 5th Flr., Los Angeles CA 90095