Towards Formal Approaches to System Resilience

Vishal Chandra Sharma, Arvind Haran, Zvonimir Rakamaric, Ganesh Gopalakrishnan. 19th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC 2013), Vancouver, Canada.
[pdf] [bib]

Abstract: Technology scaling and techniques such as dynamic voltage/frequency scaling are predicted to increase the number of transient faults in future processors. Error detectors implemented in hardware are often energy inefficient, as they are “always on.” While software-level error detection can augment hardware-level detectors, creating detectors in software that are highly effective remains a challenge. In this paper, we first present a new LLVM-level fault injector called KULFI that helps simulate faults occurring within CPU state elements in a versatile manner. Second, using KULFI, we study the behavior of a family of well-known and simple algorithms under error injection. (We choose a family of sorting algorithms for this study.) We then propose a promising way to interpret our empirical results using a formal model that builds on the idea of predicate state transition diagrams. After introducing the basic abstraction underlying our predicate transition diagrams, we draw connections to the level of resilience empirically observed during fault injection studies. Building on the observed connections, we develop a simple, and yet effective, predicate-abstraction-based fault detector. While in its initial stages, ours is believed to be the first study that offers a formal way to interpret and compare fault injection results obtained from algorithms from within one family. Given the absolutely unpredictable nature of what a fault can do to a computation in general, our approach may help designers choose amongst a class of algorithms one that behaves most resilient of all.


  author = {Vishal Chandra Sharma and Arvind Haran and Zvonimir Rakamari\'c and
    Ganesh Gopalakrishnan},
  title = {Towards Formal Approaches to System Resilience},
  booktitle = {Proceedings of the 19th IEEE Pacific Rim International Symposium on
    Dependable Computing (PRDC)},
  year = {2013},
  pages = {41--50},