Systematic Debugging Methods for Large Scale HPC Computational Frameworks

HMBCRG_CiSE screenshot

Abstract

Parallel computational frameworks for high performance computing (HPC) are central to the advancement of simulation based studies in science and engineering. Unfortunately, finding and fixing bugs in these frameworks can be extremely time consuming. Left unchecked, these bugs can drastically diminish the amount of new science that can be performed. This paper presents our systematic study of the Uintah computational framework, and our approaches to debug it more incisively. Our key insight is to leverage the modular structure of Uintah which lends itself to systematic debugging. In particular, we have developed a new approach based on Coalesced Stack Trace Graphs (CSTGs) that summarize the system behavior in terms of key control flows manifested through function invocation chains. We illustrate several scenarios how CSTGs could help efficiently localize bugs, and present a case study of how we found and fixed a real Uintah bug using CSTGs

Citation

Alan Humphrey, Qingyu Meng, Martin Berzins, Diego Caminha B. de Oliveira, Zvonimir Rakamaric, Ganesh Gopalakrishnan
Systematic Debugging Methods for Large Scale HPC Computational Frameworks
Computing in Science and Engineering (CiSE), 16(3): 48--56, 2014.

BibTeX

@article{2014_CiSE_hmbcrg,
  title = {Systematic Debugging Methods for Large Scale HPC Computational Frameworks},
  author = {Alan Humphrey and Qingyu Meng and Martin Berzins and Diego Caminha B. de Oliveira and Zvonimir Rakamaric and Ganesh Gopalakrishnan},
  journal = {Computing in Science and Engineering (CiSE)},
  publisher = {IEEE},
  volume = {16},
  number = {3},
  pages = {48--56},
  month = {May},
  year = {2014}
}