FUSED: A Low-cost Online Soft-Error Detector

Vishal Chandra Sharma, Zvonimir Rakamaric, Ganesh Gopalakrishnan. 10th IEEE Workshop on Silicon Errors in Logic – System Effects (SELSE 2014), Palo Alto, CA, USA.
[pdf] [bib]

Abstract: The growth in soft error rates caused by shrinking device geometries and transistor variability can undermine system reliability, requiring cross-layer resilience solutions. In this paper, we make following contributions to this area. First, we introduce a new framework called FUSED in which soft-error detectors are automatically compiled from and inserted into application code through the Rose compilation framework that is widely used in HPC. Our error detectors are based on control-flow tracking through predicate transitions. Second, we develop a new heuristic based on the idea of invalid predicate transitions to identify sensitive-code-blocks causing silent data corruption (SDC) in a program’s execution output. New results report in this paper include showing the feasibility of using likely program invariants (in the form of predicate transitions) in realistic code, automation of error detector insertion, use of our empirical findings to diagnose SDC causing code blocks, and evaluation of these techniques on a non-trivial scientific benchmark. Preliminary evaluation on the SuperLU scientific library – a direct solver for sparse and unsymmetric system of linear equations – indicates that our detectors achieve an error detection rate of upto 90.5% while causing an average overhead of only 15.7% in the application runtime.

Bibtex:

@inproceedings{selse2014-srg,
  author = {Vishal Chandra Sharma, Zvonimir Rakamari\'c, Ganesh Gopalakrishnan},
  title = {{FUSED}: A Low-cost Online Soft-Error Detector},
  booktitle = {10th IEEE Workshop on Silicon Errors in Logic---System Effects (SELSE)},
  year = {2014},
  note = {Poster paper},
}