Practical Formal Correctness Checking of Million-core Problem Solving Environments for HPC

Share

Diego Caminha B. de Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan, Alan Humphrey, Qingyu Meng, Martin Berzins. 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE 2013), San Francisco, CA, USA.
[pdf] [bib]

Abstract: While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from light-weight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize off-site debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.

Bibtex:

@inproceedings{secse2013-orghmb,
  author = {Diego Caminha B. de Oliveira and Zvonimir Rakamari\'c
    and Ganesh Gopalakrishnan and Alan Humphrey and Qingyu Meng
    and Martin Berzins},
  title = {Practical Formal Correctness Checking of Million-core
    Problem Solving Environments for {HPC}},
  booktitle = {Proceedings of the 5th International Workshop
    on Software Engineering for Computational Science and Engineering
    (SE-CSE)},
  year = {2013},
  pages = {75--83},
}