Lightweight Opportunistic Memory Resilience

Speaker: Irina Alam
Affiliation: Ph.D. Candidate

Also Via Zoom: https://ucla.zoom.us/j/95227278944?pwd=KzN4cWhiWk9mNTdGTXVkYVg1WXFZdz09

Abstract: The reliability of memory subsystems is fast becoming a major bottleneck in computing systems. Memories are primarily designed to maximize bit storage density, making them particularly sensitive to manufacturing process variation, environmental operating conditions, and aging-induced wearout.  The challenge with memory fault tolerance is that they need to be effective but with minimal overhead. Hence, we focus on developing a complementary suite of novel methods for tolerating faults and correcting errors in different levels of memory hierarchy while taking into account software behavior. The solutions proposed are tuned to particular memory characteristics; lightweight solutions for low-cost embedded memories and latency-critical on-chip caches while stronger protection for off-chip main memory subsystems.

We first target on-chip caches and scratchpads in IoT devices. We proposed FaultLink, a technique that tolerates hard faults in SRAM-based software-managed scratchpad memories that are detected either during testing/in-field or when running at lower than nominal voltages. FaultLink optimally places sections of program code and data into fault-free segments of the memory address space. We extended FaultLink for approximation tolerant deep learning inference applications where the program sections are placed in partially faulty memory segments based on their tolerance capability. For unpredictable single-bit soft errors that occur during runtime, we proposed two lightweight error correction techniques: Software Defined Error Localization Code (SDELC) and Parity++. SDELC first localizes single-bit errors (SBEs) to a specific chunk within the data and then heuristically recovers from these localized errors by exploiting observable side information about the application’s memory contents. Parity++ is a novel unequal message protection scheme that preferentially provides stronger error protection to certain “special messages”.

While these techniques work well for on-chip memories, reliability is of significant concern for off-chip memories as well. The single-bit error rate in DRAMs is steadily increasing and the emerging non-volatile memory (NVM) technologies suffer from high stochastic bit error rates. DRAM manufacturers are adopting on-die error correction coding (ECC) schemes, along with within memory controller ECC, to correct SBEs in the memory. However, we have shown that today’s standard on-die ECCs can lead to silent data corruption if not designed correctly. We proposed a collaborative on-die and in-controller error correction scheme that prevents silent data corruption and corrects 99.9997% double-bit errors at absolutely no additional storage, latency, and area overheads. For NVMs we proposed Compression with Multi-ECC (CME) for magnetic memories. We use compression to reduce the size of the memory lines. Then, based on the total compression, we opportunistically use the saved space to pack in ECC bits for much stronger protection. We then focused on Phase Change Memories (PCM) where we proposed optimized PCM architectures that help to reduce read latency to improve performance by trading off capacity. Overall, with memory reliability being a major bottleneck in today’s systems, these novel solutions are expected to alleviate this problem, help cope with unique outcomes of hardware variability in memory systems and provide improved reliability at minimal cost.

Biography: Irina Alam received her B.Eng. from Nanyang Technological University, Singapore in 2014 and M.S. from University of California, Los Angeles (UCLA) in 2018. She is currently working towards her Ph.D. degree in the Electrical and Computer Engineering department at UCLA. She worked as a Product Engineer at Micron Semiconductor Asia Pte. Ltd., Singapore from 2014 to 2016. She received UCLA EE Department Fellowship, two Best Paper Awards (ESWeek ’17 and SELSE ’18), UCLA EE Outstanding MS Thesis in Circuits and Embedded Systems (2018), and Cadence Women in Technology scholarship (2019). Her current research focuses on improving memory reliability and performance.

For more information, contact Prof. Puneet Gupta (puneet@ee.ucla.edu)

Date/Time:
Date(s) - Aug 06, 2021
1:00 pm - 3:00 pm

Location:
E-IV Tesla Room #53-125
420 Westwood Plaza - 5th Flr., Los Angeles CA 90095