EE PhD Student Duo wins the 2016 Qualcomm Innovation Fellowship to Improve System Resilience Against Memory Errors

Mark Gottscho and Clayton Schoeny are one of just eight groups nationwide to win the $100,000 Qualcomm Innovation Fellowship this year. The two UCLA PhD candidates in Electrical Engineering also completed both their BS and MS at the university.

When Mark Gottscho and Clayton Schoeny began their first day of college at UCLA, possibly in the same lecture hall almost ten years ago, they didn’t know each other. Nor could they have known that they were going to stick around after the four-year bachelor’s program to get their masters’. Today, they are working to complete their PhDs in electrical engineering — alongside one another.

The pair have won the highly-competitive 2016 Qualcomm Innovation Fellowship for their research proposal on so-called “Software-Defined Error-Correcting Codes,” a new way to recover from errors in computer memories. The elite fellowship selected eight two-student research teams from a pool of 129 applications limited to the top 18 electrical engineering and computer science graduate programs across the country.

“If you look at the list of past and current winners and finalists, [the fellowship’s] pretty competitive and there are some amazing research groups,” Gottscho said. “So we are especially surprised and flattered that we won, really.”

Gottscho’s area of research is computer architecture and is advised by Prof. Puneet Gupta, while Schoeny studies information theory under the guidance of Prof. Lara Dolecek. Schoeny said he and Gottscho never really got the chance to work together prior to this project, apart from a few times their groups had crossed paths.

“Both of our groups study errors in computing, but from totally different angles,” Schoeny said. “We decided to come together to brainstorm and try to take advantage of our different skill sets. And then just over the course of a few months, without a specific goal in mind for collaboration, [the idea] just organically came out from both sides.”

With the advance of semiconductor manufacturing technology, the tiny transistors in a computer are made smaller and smaller, which exposes the memory to more and more errors. According to Schoeny, the error-correcting codes (ECCs) that are commonly used are all based on mathematical foundations from the 1950s. The purpose of ECCs are to detect and correct erroneous bit flips. For instance, in common electronics, single-bit errors are correctable. But when faced with a double-bit error, there is no way to correct it.

“If an error is detected but uncorrectable, the system just gives up and says, ‘Look, an error happened. I don’t want to corrupt anything by continuing, so let’s crash instead,’” Gottscho said. “But if you imagine this happening in your autonomous car or an airplane, that’s not a good way to do it.”

Schoeny and Gottscho’s proposed solution to the problem resides in the use of “side information,” or hints, about application data stored in the memory device.

“When an error happens, you need to figure out what the original data was. And it turns out, through some analysis that we did, that there’s a small number of possibilities the original data could have been,” said Gottscho. “And then all you have to do is choose amongst those possibilities — which one is the most likely given what we know about the program?”

To achieve that, Schoeny and Gottscho’s proposed Software-Defined ECC scheme needs to learn about the semantics of how data is stored and used by software. This can help the system recover the original data.

The new approach will require design expertise in hardware, software, and the error-correcting code itself, which the UCLA duo is ideally suited for. If their approach succeeds, it will have a significant impact on safety-critical systems. It could make self-driving cars safer by reducing the chance of having to stop and completely restart the car electronics whenever an error is detected.

Another area of application is in supercomputers, where the sheer amount of hardware dedicated to the solving of a scientific problem makes it extremely susceptible to memory errors.

“Researchers have been warning that memory errors are a major obstacle to the development of exascale computers. A future supercomputer could have uncorrectable memory errors very often, on the scale of hours or even minutes,” Gottscho said. “And if you don’t have a way of dealing with it, you can’t afford to crash your megawatt-level supercomputer that often. You’ll never get anything done.”

Today’s supercomputers accept the inevitable crash, but try to save progress by checkpointing. Similar to a video game, the computer will regularly save its progress, so that whenever it crashes from a memory error, it can roll back to the last checkpoint and restart execution from there.

“But that’s expensive. Because even if you don’t get errors, you have to keep backing up. And that takes a lot of energy and time,” Gottscho said. “And when an error does happen, you roll back to the last save point, which is also bad, because you end up spending less time doing any useful work.”

Schoeny and Gottscho’s proposed Software-Defined ECC, on the other hand, could reduce the likelihood of having to roll back after an error. Therefore, future supercomputers could be much more scalable.

The pair first revealed their idea at Qualcomm’s fellowship finals in San Diego during late March, along with 33 other pairs of finalists. A week later, their preliminary work outlining the concept won a Best Paper award at the IEEE Silicon Errors in Logic — System Effects workshop in Austin, Texas. The paper will be presented again in a special session at the IEEE/IFIP International Conference on Dependable Systems and Networks at the end of June.

The Qualcomm fellowship will support Schoeny and Gottscho with $100,000 of combined support for the 2016-2017 academic year, and will provide them with mentorship from the company’s researchers and engineers.